Maximum Caliber Can Characterize Genetic Switches with Multiple

Feb 6, 2018 - Molecular and Cellular Biophysics, University of Denver, Denver, Colorado 80209, United States. ‡ Department of ... The observed prote...
1 downloads 12 Views 1MB Size
Article Cite This: J. Phys. Chem. B XXXX, XXX, XXX−XXX

pubs.acs.org/JPCB

Maximum Caliber Can Characterize Genetic Switches with Multiple Hidden Species Taylor Firman,† Stephen Wedekind,‡ T. J. McMorrow,‡ and Kingshuk Ghosh*,‡ †

Molecular and Cellular Biophysics, University of Denver, Denver, Colorado 80209, United States Department of Physics and Astronomy, University of Denver, Denver, Colorado 80209, United States



ABSTRACT: Gene networks with feedback often involve interactions between multiple species of biomolecules, much more than experiments can actually monitor. Coupled with this is the challenge that experiments often measure gene expression in noisy fluorescence instead of protein numbers. How do we infer biophysical information and characterize the underlying circuits from this limited and convoluted data? We address this by building stochastic models using the principle of Maximum Caliber (MaxCal). MaxCal uses the basic information on synthesis, degradation, and feedbackwithout invoking any other auxiliary species and ad hoc reactionsto generate stochastic trajectories similar to those typically measured in experiments. MaxCal in conjunction with Maximum Likelihood (ML) can infer parameters of the model using fluctuating trajectories of protein expression over time. We demonstrate the success of the MaxCal + ML methodology using synthetic data generated from known circuits of different genetic switches: (i) a single-gene autoactivating circuit involving five species (including mRNA), (ii) a mutually repressing two-gene circuit (toggle switch) with seven species (including mRNA) considering stochastic time traces of two proteins, and (iii) the same toggle switch circuit considering stochastic time traces of only one of the two proteins. To further challenge the MaxCal + ML inference scheme, we repeat our analysis for the second and third scenario with traces expressed in noisy fluorescence instead of protein number to closely mimic typical experiments. We show that, for all of these models with increasing complexity and obfuscation, the minimal model of MaxCal is still able to capture the fluctuations of the trajectory and infer basic underlying rate parameters when benchmarked against the known values used to generate the synthetic data. Importantly, the model also yields an effective feedback parameter that can be used to quantify interactions within these circuits. These applications show the promise of MaxCal’s ability to characterize circuits with limited data, and its utility to better understand evolution and advance design strategies for specific functions.



expressed proteins using fluorescent tags. Other underlying molecular actors such as the promoter, mRNA, promoterprotein complexes, protein-protein complexes, etc., are not usually monitored. This poses a fundamental problem in quantitative model building. In spite of the limitation of partial data, protein expression profiles monitored over a long period of time encode valuable information about the entire system. The observed protein expression time traces fluctuate significantly due to the small number of molecules involved in the process.7−15 The intricate details of the fluctuation profiles directly depend on the underlying circuit topology. This provides a powerful avenue to reverse engineer and infer network details by carefully studying these noisy protein expression levels.16−21 Mass-action type models only describe averages.22 Consequently, they can be used to fit only the first moment of the observables, instead of

INTRODUCTION Much of biology is regulated by naturally occurring genetic circuits involving feedback. Synthetic biologists are also designing small genetic circuits to achieve particular functions.1−5 One common motifboth in natural and synthetic biologyis positive feedback circuits that behave like switches, exhibiting bimodal protein concentration profiles with protein levels alternating between high (H) and low (L) copy number states. Understanding quantitative features of these types of circuits is important when regulating function in circuit design. For example, knowledge about feedback strength and its dependence on tunable parameters is essential to initiate and control the toggling between H and L states. Bimodality and switching between the two states is essential to adopt bethedging strategies in microbes.5 As microbes evolve under different stressors, they may adjust their strategy and utilize features of bimodality by tuning the feedback. Negative feedback circuits are also critical in controlling characteristics of gene expression.6 However, quantification of feedback and other attributes of the circuit is a challenge due to limited information. Most experiments typically only follow a few © XXXX American Chemical Society

Special Issue: Ken A. Dill Festschrift Received: December 13, 2017 Revised: January 31, 2018 Published: February 6, 2018 A

DOI: 10.1021/acs.jpcb.7b12251 J. Phys. Chem. B XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry B the actual stochastic trajectory itself. This leads to data reduction and underutilizes the full information content of the measured data. Thus, mass-action models are not suited to learn from the noise profile of the trajectory. Stochastic models of gene expression are needed to harness the hidden information from fluctuating time traces. However, traditional stochastic approaches are not without limitations either. The most common practice of stochastic modeling is “bottom-up”: starting with defined sets of reactions with multiple rate parameters and using the Gillespie algorithm23 or related methods to simulate the system. Parameters are then varied to fit the path distribution or different observables. These models invoke auxiliary species and reactions between them to mimic feedback. Naturally, the knowledge about these hidden species and underlying reaction networks are rarely known a priori. Such ad hoc assumptions can lead to overparameterization, a typical problem in modeling gene networks.24 Models with too many parameters are also problematic for efficient parameter searches17 and circuit design. We recently proposed an alternate “top-down” approach in modeling fluctuations in these networks using the principle of Maximum Caliber (MaxCal).25−27 MaxCal is a variational principlesimilar to Maximum Entropy (MaxEnt)used to predict trajectory probabilities by maximizing the path entropy subject to constraints. The primary advantage of MaxCal is twofold: (i) it utilizes a systematic approach to build models of increasing complexity using the minimal set of constraints without assuming any particular reaction network, and (ii) it yields path probability distributions that can be used to directly compute the likelihood of the observed trajectory without any data reduction. This lends itself well to Maximum Likelihood (ML) estimation to objectively determine the model parameters. In addition, MaxCal offers a framework to circumvent a frequently encountered problem in analyzing experimental trajectories that are typically measured in fluorescence rather than protein numbers. The amount of fluorescence measured per protein is noisy, constantly fluctuating around an average value. This poses a major challenge in deconvoluting the fluorescence fluctuation from the protein number fluctuation. However, the trajectory-based approach of MaxCal naturally incorporates the fluorescence distribution when computing the likelihood of an observed trajectory reported in fluorescence.25 Path entropy maximization has also been applied to analyze single molecule and biomolecular simulation data.28−31 In our recent work, we successfully demonstrated these features of MaxCal on a single-gene autoactivating (SGAA) circuit with synthetic time traces generated from a known reaction network with four species: activated and inactivated promoter, protein monomer, and protein dimer. We showed that it is possible to infer underlying rate parameters with the stochastic trajectory of only one of these species, specifically the information about the protein monomer. For simplicity, we ignored the presence of mRNA when generating the synthetic data. New questions now emerge: How likely are we to succeed with MaxCal when the number of hidden species increases further? For example, can we model the same single-gene autoactivating circuit upon including mRNA in the reaction network? Can we even describe more complex circuits with multiple genes expressing multiple proteins that interact with each other? To test the scalability of MaxCal to describe systems with increasing numbers of hidden species, we generate stochastic time trajectories for four specific models. In the first

model (M1), we generate stochastic time traces for SGAA while explicitly considering mRNA. This is a natural extension of our previous work.25 The second model (M2) simulates positive feedback in a two-gene mutually repressing circuit known as a toggle switch (TS)1,26,32with mRNA dynamics included and provides the fluctuating time traces of both proteins for use in the MaxCal inference machinery. In the third model (M3), we test MaxCal’s ability to infer underlying rate parameters when the stochastic trajectories in M2 are given in fluorescence rather than protein number, mimicking experimental conditions. In the fourth model (M4), we use the same TS circuit but with the temporal information on only one protein (expressed in protein number). Thus, M4 strongly tests the scalability of MaxCal inference by using the same TS circuit but providing information on only one of the proteins with six other species remaining hidden. In the fifth model (M5), we further challenge MaxCal by giving it the stochastic trajectory from M4 in fluorescence rather than protein number. The underlying parameters used to generate these synthetic time traces, thereby considered as “true” values, are compared against MaxCal inferred parameters for benchmarking. We find that, in all five cases, MaxCal performs reasonably well. This demonstrates the potential of MaxCal to infer information about network details that are not directly measurable in experiments, even when there are many hidden species compared to available data and even when they are further convolved with fluorescence noise. In the next section, we describe the details of the underlying reaction networks as well as our methodology for different models.



MATERIALS AND METHODS Generating Synthetic Time Traces for SGAA. Synthetic biologists are building small motifs to achieve specific functions and test different hypotheses.3,5,6,33−37 Positive feedback is frequently employed to achieve switch-like behavior in these applications. We recently demonstrated MaxCal’s ability to infer underlying circuit details using the simplest motif of a single gene autoactivating (SGAA) circuit.25 Among several models of autoactivation in different biological contexts,38−45 we adopted the one studied by Elston et al.46 to generate stochastic time traces to serve as a proxy for experimental data. For simplicity, we ignored the presence of mRNA that typically influences protein expression.25 In the present work, we use the M1 model of SGAA that explicitly accounts for mRNA when generating synthetic data. Specifically, we use the following reaction scheme g

α → α + a; p

a → a + A;

d

a→⌀ r

A→⌀

fd

A + A ⇄ A2 bd

fp

α + A 2 ⇄ α* ; bp

g*

α* → α* + a (1)

where some generic mRNA a is created from its corresponding gene α at a rate of g and is degraded at a rate of d. This mRNA produces the protein it encodes (A) at a rate of p. Protein A can then degrade at a rate of r or dimerize into A2 with forward and backward rates of fd and bd, respectively. To incorporate feedback, A2 binds and unbinds from its promoter at rates of f p B

DOI: 10.1021/acs.jpcb.7b12251 J. Phys. Chem. B XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry B ⎛ g ⎞ ⟨lα⟩NL peff = p⟨a⟩NL ≈ p⎜ ⎟ ≈ , ⎝d⎠ Δt ⎛ g * ⎞ ⟨l ⟩ * = p⟨a⟩N ≈ p⎜ ⎟ ≈ α NH , peff H ⎝d ⎠ Δt

and bp, respectively, sending the promoter into and out of its activated state, α*. In this activated state, α* creates mRNA a at a much faster rate g*, capturing the essentials of a positive feedback mechanism. Rates are chosen to produce switching times that are representative of experiments5 while maintaining synthesis and degradation rates of both protein and mRNA in the realm of typical values.47 A Gillespie algorithm23 was used to generate stochastic trajectories of protein (A) levels that exhibit bistability. Next, we describe the MaxCal model and formalism used to analyze the generated synthetic data and infer underlying details of the circuit. MaxCal Model for M1. MaxCal predicts path distributions by maximizing the path entropy, or caliber, subject to known constraints enforced via Lagrange multipliers.26−31,48−54 For model M1 describing SGAA, we follow our previous work25 by using three minimal constraints of protein synthesis, degradation, and positive feedback (autoactivation). The first two constraints are imposed on the averages of two random variables, lα and lA, by two Lagrange multipliers, hα and hA. The variable lα counts the stochastic number of particles being synthesized in a time interval (Δt) and ranges between zero and a predefined maximum M. The random variable lA is the number of previously existing proteins that survived degradation within the time interval Δt. Hence, 0 ≤ lA ≤ NA, where NA is the number of proteins present at the beginning of the time interval. The constraint of autoactivation is imposed on the average of lαlA by the corresponding Lagrange multiplier KA, thus coupling the protein production and degradation variables. This is the minimal model needed to describe SGAA (see our previous work25 for a detailed explanation) and defines caliber as M

NA

C = −∑

M

∑ Pl ,l

α A

log Plα , lA + hα

lα = 0 lA = 0 M

+ hA

r (N ) =

NA

M α A

lα = 0 lA = 0

+ KA

F=

NA α A

∑ ∑ lαPl ,l

α A

NA

∑ ∑ lαlAPl ,l

M

α A

⎛ NA ⎞ ⎟⎟ exp(hαlα + hAlA + KAlαlA) l lA = 0 ⎝ A ⎠

(5)

where X represents any combination of the stochastic variables of interest used in the first part of the equation. We anticipate 0 < F < 1 for positive feedback. Estimating Parameters from Stochastic Trajectories in M1. To combine the MaxCal formalism with ML estimation, we now briefly describe how to compute the likelihood of an observed trajectory using the MaxCal derived path probability (eq 3). The details can be found in our previous work.25 We first discretize the path in small intervals of the sampling time Δt, yielding T + 1 frames. Let i and j be protein numbers at two subsequent frames (say t and t + 1). The transition probability P(j, t + 1; i, t) is then given by

(2)

P(j , t + 1; i , t ) = Pi → j =

i

∑ ∑ δ(lα + lA − j)Pl ,l

α A

lα = 0 lA = 0

(6)

where δ is the Dirac delta function and Plα,lA is described by eq 3. MaxCal has an upper limit on protein production in the interval Δt, unlike experimental systems or synthetic data where a rare fluctuation may give rise to more than M proteins being created in a frame. To appropriately address this, we calculate transition probabilities over multiple intervals (m frames) for a given set of MaxCal parameters using discrete-time FSP.55 The multistep transition probability is denoted as P(j, t + m; i, t) and abbreviated as P(i→j),m. With this definition, the likelihood of observing the given trajectory is calculated as

NA

∑ ∑ ⎜⎜ lα = 0

M

∑ ∑ ∑ XPeq(NA)Pl , l NA = 0 lα = 0 lA = 0

⎛ NA ⎞ Plα , lA = Q−1⎜⎜ ⎟⎟ exp(hαlα + hAlA + KAlαlA) ⎝ lA ⎠ M

(4)

N

(⟨lα 2⟩tot − ⟨lα⟩tot 2 )(⟨lA 2⟩tot − ⟨lA⟩tot 2 ) ∞

where Plα,lA is the probability of observing a particular combination of lα and lA during the time interval Δt. Maximizing the caliber yields the path probabilities as

Q=

∑ Peq(N )r(N )

⟨lαlA⟩tot − ⟨lα⟩tot ⟨lA⟩tot

⟨X ⟩tot =

NA

lα = 0 lA = 0

r≈

where ⟨···⟩i represents the average of a quantity of interest when NA = i, NL is the peak of the protein number distribution in the low state, NH is the peak in the high state, and Peq(N) is the probability of having N proteins within the system at relative equilibrium (calculated via discrete-time FSP55). Furthermore, we can extract an effective feedback metric F in terms of the Pearson correlation coefficient between lα and lA

lα = 0 lA = 0

∑ ∑ lAPl ,l

N − ⟨lA⟩N , N Δt

(3)

This path probability distribution for time Δt can be used to propagate probabilities in time and compute the likelihood of stochastic trajectories (experimental or synthetic) of arbitrary time duration. This is done via a discrete-time version of the Finite State Projection formalism of Munsky et al.55 to limit the phase space of protein number while maintaining a known error bound. The steady state protein number distribution can also be obtained with this technique by advancing the distribution in time to relative equilibrium (e.g., 100 times the average residence time in either state). The effective protein synthesis rate (activated p*eff and inactivated peff) and protein degradation rate (r) in SGAA can also be estimated from the path distribution generated above. Specifically,

5

3=

∏ P(Nt + m , t + m; Nt , t = m(n − 1)) n=1

=

∏ {i → j}

ω

(i → j), m P(i → j), m

(7)

where 5 is T/m rounded down to the nearest integer, Nt is the number of A proteins present in frame t, and ω(i→j), m is the total number of i → j transitions over m frames, as seen in the trace. Equation 7 is then maximized with respect to hα, hA, KA, and M, to determine their representative values for a given trace. We choose m to be the average residence time (in C

DOI: 10.1021/acs.jpcb.7b12251 J. Phys. Chem. B XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry B

autofeedback by further constraining the average of lαlA and lβlB with the Lagrange multiplier KAα. The rationale for this additional term in the Hamiltonian is twofold: (i) Mutual repression implies autoactivation. We intend to infer model parameters for experimentally relevant values of sampling time Δt (on the order of minutes), and for such relatively large time intervals, one is likely to witness the impact of autoactivation. (ii) Perhaps more importantly, the four constraints of ⟨lαlB⟩, ⟨lβlA⟩, ⟨lαlA⟩, and ⟨lβlB⟩ now provide a complete description that includes all possible cross-talks between the birth and degradation variables up to the second order and avoids arbitrarily assigning KAα = 0. We intend to let the data select values of KAα within a model that is self-consistent up to the second order. This minimal model of repression and promotion together yield the caliber as

frames) in the high and low state, but our results are robust for the choice of m around this residence time.25 Generating Synthetic Data for TS. In the remaining models, we test MaxCal’s inferential power with a two-gene mutually repressing TS circuit. In a previous work,26 we showed that MaxCal can capture fluctuations in TS using synthetic data. However, we neglected the role of mRNA in the generation of synthetic data and ignored the impact of experimentally relevant sampling times (Δt) on our inference methods. Similar to M1, we now explicitly incorporate mRNA using the following reaction scheme32 g

α → α + a; p

a → a + A; f

α + A ⇄ α* ;

d

a→⌀ r

A→⌀ g

α* → α* + a ;

M

g*

α* → α* + b

b g

α → α + b; p

b → b + B; f

α + B ⇄ α′ ; b

NA

C = −∑

M

NB

∑ ∑ ∑ Pl ,l ,l ,l

α A β B

log Plα , lA , lβ , lB

lα = 0 lA = 0 lβ = 0 lB= 0

d

b→⌀

M

+ hα

r

B→⌀ g*

α′ → α′ + a ;

NA

M

NB

∑ ∑ ∑ ∑ (lα + lβ)Pl ,l ,l ,l

α A β B

lα = 0 lA = 0 lβ = 0 lB= 0 M

g

α′ → α′ + b

+ hA

(8)

NA

M

NB

∑ ∑ ∑ ∑ (lA + lB)Pl ,l ,l ,l

α A β B

lα = 0 lA = 0 lβ = 0 lB= 0

where generic mRNA’s a and b are created from the gene α (at different loci) at a rate of g and are degraded at a rate of d. These mRNAs produce their respective proteins A and B at a rate of p, while A and B degrade at a rate of r. To incorporate feedback, either protein can bind and unbind the promoter site at rates of f and b, respectively, sending the promoter into and out of different expression states (α* for A and α′ for B). In α*, the gene creates mRNA b at a much slower rate g*, while α′ creates a at the slower rate g*, capturing the essentials of mutual repression. Rates are again chosen to produce representative switching times5 while maintaining realistic synthesis and degradation rates.47 A Gillespie algorithm23 is then used to generate stochastic trajectories of protein levels that exhibit bistability. M2, M3, M4, and M5 use the same underlying circuit to generate the stochastic time traces. M2 and M3 (in fluorescence) provide time trajectories of both proteins A and B to MaxCal for inference. To further challenge MaxCal, M4 and M5 provide the noisy protein number (M4)/ fluorescence trajectory (M5) of only one protein to mimic a situation where only one of the proteins may be tagged. MaxCal Model for TS. For TS, two genes express two different proteins, and consequently, we extend our MaxCal model from M1 to two proteins. For the time interval Δt, we utilize two production/birth variables, lα and lβ, ranging from zero to M for proteins A and B, respectively. Likewise, we define lA and lB to denote the number of preexisting A and B molecules, respectively, that remain after the time interval Δt. Accordingly, 0 ≤ lA ≤ NA and 0 ≤ lB ≤ NB with NA and NB being the initial number of A and B proteins present, respectively. Lagrange multipliers hα, hβ, hA, and hB are used to constrain the average values of lα, lβ, lA, and lB. However, we assume symmetry in A and B by setting hα = hβ and hA = hB. Next, the constraint of mutual repression is imposed by coupling the production variable of A with the degradation variable of B and vice versa. Specifically, we constrain the average of lαlB and lβlA using the corresponding Lagrange multiplier KAβ, identical to earlier work on TS using MaxCal.26 However, in contrast to the previous model, we now add

M

+ KAα

NA

M

NB

∑ ∑ ∑ ∑ (lαlA + lβlB)Pl ,l ,l ,l

α A β B

lα = 0 lA = 0 lβ = 0 lB= 0 M

+ KAβ

NA

M

NB

∑ ∑ ∑ ∑ (lβlA + lαlB)Pl ,l ,l ,l

α A β B

lα = 0 lA = 0 lβ = 0 lB= 0

(9)

where Plα,lβ,lA,lB is the probability of obtaining a particular combination of lα, lβ, lA, and lB. Upon maximizing the caliber, the path probability Plα,lβ,lA,lB becomes ⎛ NA ⎞⎛ NB ⎞ Plα , lA , lβ , lB = Q−1⎜⎜ ⎟⎟⎜⎜ ⎟⎟ exp[hα(lα + lβ) + hA(lA + lB) ⎝ lA ⎠⎝ lB ⎠ + KAα(lαlA + lβlB) + KAβ(lβlA + lαlB)] M

Q=

NA

M

⎛ NA ⎞⎛ NB ⎞ ⎟⎟⎜⎜ ⎟⎟ exp[hα(lα + lβ) lB= 0 ⎝ lA ⎠⎝ lB ⎠ NB

∑ ∑ ∑ ∑ ⎜⎜ lα = 0 lA = 0 lβ = 0

+ hA(lA + lB) + KAα(lαlA + lβlB) + KAβ(lβlA + lαlB)] (10)

Similar to eq 4 for M1, we can approximate effective protein synthesis rates (unrepressed peff and repressed peff * ) and protein degradation rates (r) for the TS circuit in terms of MaxCal variables ⟨lβ⟩NL , NH ⎛ g ⎞ ⟨lα⟩NH , NL peff = p⟨a⟩NH , NL = p⟨b⟩NL , NH ≈ p⎜ ⎟ ≈ = ⎝d⎠ Δt Δt ⟨l ⟩ ⎛ g * ⎞ ⟨l ⟩ * = p⟨a⟩N , N = p⟨b⟩N , N ≈ p⎜ ⎟ ≈ α NL , NH = β NH , NL peff L H H L ⎝d ⎠ Δt Δt rA(NA , NB) = ∞

r≈

NA − ⟨lA⟩NA , NB NAΔt

,

rB(NA , NB) =



∑ ∑ NA = 0 NB= 0



Peq(NA , NB)rA(NA , NB) =

NB − ⟨lB⟩NA , NB NBΔt ∞

∑ ∑

Peq(NA , NB)rB(NA , NB)

NA = 0 NB= 0

(11) D

DOI: 10.1021/acs.jpcb.7b12251 J. Phys. Chem. B XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry B where ⟨···⟩i,j represents the average of a quantity given that there are i and j numbers of A and B proteins initially present, respectively, NL and NH are the most likely number of proteins in the repressed and unrepressed state, respectively, and Peq(i, j) is the probability of having NA = i and NB = j at relative equilibrium (calculated via discrete-time FSP55). With two proteins present in the system, two complementary feedback metrics can be defined. Autofeedback, FAα, is defined as

where 5 is T/m rounded down to the nearest integer, ω(i→j),(k→l),m is the total number of simultaneous i → j and k → l transitions over m frames, and NA,t (NB,t) denotes the number of A (B) proteins at frame t. The likelihood is then maximized with respect to hα, hA, KAα, KAβ, and M to select representative values. As in M1 inference, m is chosen to be the number of frames equivalent to the average residence time between states. Estimating Parameters from Stochastic Trajectories in M4. When considering the likelihood of only one of the stochastic protein expression traces (say protein A), the two protein joint transition probabilities must be summed over all possible starting values of NB and all possible ending values of NB, producing a new transition probability of

⟨lαlA⟩tot − ⟨lα⟩tot ⟨lA⟩tot

FAα =

2

(⟨lα ⟩tot − ⟨lα⟩tot 2 )(⟨lA 2⟩tot − ⟨lA⟩tot 2 ) ⟨lβlB⟩tot − ⟨lβ⟩tot ⟨lB⟩tot

=

(⟨lβ 2⟩tot − ⟨lβ⟩tot 2 )(⟨lB2⟩tot − ⟨lB⟩tot 2 )



P2(j , t + 1; i , t ) = P2,(i → j) =

(12)

⟨X ⟩tot =

M



NA

M

α A β B

with X being any combination of the stochastic variables. Conversely, interfeedback, FAβ, is defined as ⟨lβlA⟩tot − ⟨lβ⟩tot ⟨lA⟩tot (⟨lβ 2⟩tot − ⟨lβ⟩tot 2 )(⟨lA 2⟩tot − ⟨lA⟩tot 2 )

=

M

k

∑ ∑ ∑ ∑ δ(lα + lA − j)δ(lβ + lB − l)Pl ,l ,l ,l

α A β B

lα = 0 lA = 0 lβ = 0 lB= 0

(15)

Similar to eq 7, we must account for rare fluctuations producing more than M proteins in a single frame, so we calculate transition probabilities over m frames as P(j, l, t + m; i, k, t), abbreviated as P(i→j),(k→l),m. This yields the likelihood function as 5

3=

∏ P(NA ,t + m , NB,t + m , t + m; NA ,t , NB,t , t = m(n − 1)) n=1

=

∏ {i → j , k → l}

ω

P2,((ii→→j),jm), m (18)

where ω(i→j),m has the same definition as it does in eq 7. Like all previous models, 3 is now a function of hα, hA, KAα, KAβ, and M, and can be maximized with respect to these parameters to determine the representative values. Creating and Analyzing Synthetic Fluorescence Trajectories for M3 and M5. The models described above rely on the input trajectory being presented in discrete protein numbers. However, gene expression is typically measured experimentally by fluorescently labeled reporter proteins. The amount of fluorescence measured per protein, defined as f, is not a fixed number but rather has a distribution of typical values. Consequently, the measured gene expression trajectory is a convolution of the protein number fluctuation intrinsic to the circuit topology and the fluorescence fluctuation inherent to the experiment. This is a fundamental challenge in data analysis. In our earlier work with SGAA,25 we established a novel way to address this issue by using a formalism called parallel fluorescence-to-number conversion, or simply PFNC. In PFNC, the likelihood function is constructed by considering the fluorescence per protein distribution along with the protein number transition probabilities. As a result, the likelihood function describes the probability of observing a particular f luorescence trajectory rather than a protein number trajectory and can be directly applied to experimental data. We now extend PFNC to the TS framework in M2 and M4 using fluorescence trajectories of both proteins in the case of M3 and only one of the proteins in M5. To test MaxCal’s inference capabilities in the case of such limited and convoluted information, we first created synthetic fluorescence trajectories by purposefully “corrupting” the same

= Pi → j , k → l i

∏ {i → j}

P(j , l , t + 1; i , k , t ) M

∏ P2(NA ,t + m , t + m; NA ,t , t = m(n − 1)) n=1

(14)

For the TS circuit, we expect 0 < FAα < 1 and −1 < FAβ < 0, but they are not restricted to these ranges. Estimating Parameters from Stochastic Trajectories in M2. To objectively select representative parameter values for a given stochastic TS trajectory (generated from eq 8 or obtained from experiment) using ML, we must first describe the likelihood of that trajectory in terms of these parameters. To start, we again discretize the trajectory into T + 1 frames with a sampling interval of Δt. With i, j as the number of A proteins at two subsequent frames (say t and t + 1) and k, l as the number of B proteins at the same two frames, the transition probability P(j, l, t+1; i, k, t) is then given by

=

(17)

5

3=

⟨lαlB⟩tot − ⟨lα⟩tot ⟨lB⟩tot (⟨lα 2⟩tot − ⟨lα⟩tot 2 )(⟨lB2⟩tot − ⟨lB⟩tot 2 )

NB

where Peq(NB|i) is the conditional probability of having NB number of B proteins given that NA = i. As before, we must use multiframe transitions to address rare fluctuations, so we calculate transition probabilities over m frames as P2(j, t + m; i, t), abbreviated as P2,(i→j),m, using discrete-time FSP.55 With this reduction, the likelihood of observing a particular expression trace, given that two proteins are present and interacting, can be calculated as

∑ ∑ ∑ ∑ ∑ ∑ XPeq(NA , NB)Pl ,l ,l ,l

(13)

=

M

× δ(lα + lA − j)Plα , lA , lβ , lB

NB

NA = 0 NB = 0 lα = 0 lA = 0 lβ = 0 lB= 0

FAβ =

i

NB = 0 lα = 0 lA = 0 lβ = 0 lB= 0

where ∞

M

∑ ∑ ∑ ∑ ∑ Peq(NB|i)

ω

(i → j),(k → l), m P(i → j),(k → l), m

(16) E

DOI: 10.1021/acs.jpcb.7b12251 J. Phys. Chem. B XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry B

Figure 1. Predicted distributions agree well with the “true” distributions for M1. (A) Protein number probability distributions from synthetic input trajectories (blue) and predicted MaxCal trajectories (red). (B) Low state and (C) high state residence time probability distributions for synthetic input trajectories (blue) and predicted MaxCal trajectories (red). The synthetic input trajectory was generated using d = 0.2 s−1, p = 0.02 s−1, fd = 5.0 × 10−3 s−1, bd = 50.0 s−1, f p = 6.0 × 10−3 s−1, bp = 3.0 × 10−5 s−1, g = 0.05 s−1, g* = 0.5 s−1, and r = 1.0 × 10−3 s−1 in eq 1, and they are recorded with Δt = 300 s. Representative MaxCal parameters are hα = −0.546, hA = 0.674, KA = 0.031, and M = 14.

likelihood of observing the fluorescence trajectory of a single protein as

stochastic protein number trajectories used in M2 and M4. For simplicity, we assume the fluorescence per protein distribution is approximately Gaussian,56−58 with an average and standard deviation of f 0 and σ, respectively. Using this assumption, the fluorescence at each time point (having Nt proteins) is then randomly selected using a convolution of Nt distributions of fluorescence per protein, i.e., another Gaussian with an average and standard deviation of Nt f 0 and Nt σ , respectively. For illustrative purposes, we will demonstrate PFNC on an experimentally relevant noise level (defined as the ratio of σ/ f 0) of 30%.56−58 To analyze this synthetic data generated in arbitrary fluorescence, we now construct the corresponding likelihood function by combining protein number path probabilities with the fluorescence per protein distribution. We assume that this fluorescence per protein distribution is known via low-intensity photobleaching experiments.59−71 Using similar notation as eq 16, the likelihood of observing a given fluorescence trajectory is now calculated as 5

3=





∏ ⎜⎜ ∑ n=1

⎝ NA ,t = 0













Φ(NA , t |ft )

NA , t + m = 0

⎞ × Φ(NA , t + m|ft + m )⎟ ⎟ ⎠

(21)

As in the preceding models, 3 is then maximized with respect to hα, hA, KAα, KAβ, and M to determine representative values in both cases.



RESULTS AND DISCUSSION

MaxCal Accurately Infers Underlying Rate Parameters and Observables for M1. Analyzing stochastic protein expression trajectories generated using eq 1 and the rates reported in Figure 1, we inferred representative Lagrange multipliers by maximizing the likelihood described in eq 7. Similar to our previous work,25 we applied our MaxCal + ML formalism to 10 replicates of 100 trajectories with enough frames to equate to 7 days at a sampling rate of 5 min/frame (Δt = 300 s), closely matching experimentally realistic conditions.5 These replicates help to demonstrate the accuracy (comparing the average to the true) and uncertainty (from the standard deviation) of inferred rates. With these extracted parameters, we inferred effective rates using eq 4, and these inferred rates compare well with the “true” rates used in simulation (see Table 1). We also computed four different entropies25 (reported in Table 1) to quantitively measure MaxCal’s ability to capture the fluctuations of the input trajectory: (i) the total trajectory entropy (SI) capturing fluctuations at all scales, (ii) entropy in the high state only (Sh), (iii) entropy in the low state only (Sl), and (iv) coarsegrain entropy (Scg) capturing the stochastic transitioning between the high and low states. The overall path entropy SI is defined as

Φ(NA , t |fA , t )Φ(NB , t |fB , t )

(19)

where fA,t ( f B,t) is the amount of fluorescence produced by A (B) proteins at frame t and Φ(N|f) is the conditional probability that N proteins are present given that a fluorescence of f has been observed. These conditional probabilities are calculated via Bayes’ theorem as Pg(f |N )Peq(N ) Ptot(f )

⎝ NA ,t = 0





× P2(NA , t + m , t + m ; NA , t , t = m(n − 1))

× P(NA , t + m , NB , t + m , t + m; NA , t , NB , t , t = m(n − 1))

Φ(N |f ) =



∏ ⎜⎜ ∑ n=1

NA , t + m = 0 NB , t = 0 NB , t + m = 0

⎞ × Φ(NA , t + m|fA , t + m )Φ(NB , t + m|fB , t + m )⎟ ⎟ ⎠



5

3=

(20)

where Pg(f |N) is the Gaussian fluorescence distribution with an average and standard deviation of Nf 0 and N σ as mentioned above, Peq(N) is the protein number distribution at relative equilibrium calculated via discrete-time FSP,55 and Ptot(f) is the fluorescence probability distribution over the entire trace. Using a similar approach, eq 18 can be extended to describe the

SI = −∑ PP i i → j log 2(PP i i → j) i,j

F

(22) DOI: 10.1021/acs.jpcb.7b12251 J. Phys. Chem. B XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry B Table 1. Comparison of True and Predicted Rates and Metrics Using MaxCal in M1 peff (s−1) peff * (s−1) r (s−1) τL→H (s) τH→L (s) SI (bits) Sh (bits) Sl (bits) Scg (bits) F

true values

predicted values

5.0 × 10−3 50.0 × 10−3 1.0 × 10−3 73.4 × 103 82.7 × 103 9.0 9.5 6.7 1.03

(5.6 ± 0.2) × 10−3 (42.2 ± 1.8) × 10−3 (0.93 ± 0.05) × 10−3 (95.8 ± 83.1) × 103 (101.9 ± 88.3) × 103 9.1 ± 0.0 9.0 ± 0.0 7.7 ± 0.1 1.02 ± 0.00 0.132 ± 0.004

Table 2. Comparison of True and Predicted Rates and Metrics Using MaxCal in M2 peff (s−1) peff * (s−1) r (s−1) τ (s) SI (bits) Ss (bits) Scg (bits) FAα FAβ

true values

predicted values

20.0 × 10−3 0.1 × 10−3 1.0 × 10−3 110.4 × 103 11.1 10.2 1.02

(13.9 ± 0.9) × 10−3 (0.15 ± 0.02) × 10−3 (0.74 ± 0.05) × 10−3 (114.8 ± 96.5) × 103 11.4 ± 0.1 10.5 ± 0.2 1.02 ± 0.00 0.086 ± 0.016 −0.396 ± 0.017

important to note that KAα is nonzero, suggesting a nontrivial role of this constraint in the inference process. Mutual repression should produce net self-promotion. This is reflected in the two feedback terms: FAβ shows an appreciable negative correlation between the presence of one protein and the production of the other, while FAα yields a positive value (see Table 2). As in M1, these feedback values cannot be computed using raw trajectory data due to the inability to track the production variable. Comparison of different entropies further highlights MaxCal’s ability to well approximate the inherent fluctuations in the synthetic data. Due to the presence of another protein, the calculation of the overall path entropy is slightly modified from eq 22 to include the other species

where Pi is the probability of having i proteins in the system and Pi→j is the probability of transitioning from i to j proteins in a single frame. Sh and Sl are defined in the same way but only considering parts of a trajectory that are in the high or low state, respectively. For Scg, the expression trajectory is simplified into a binary trajectory between the low state (Ncg = 0) and the high state (Ncg = 1) and Scg is then calculated in the same fashion as eq 22 (see our previous work25 for more detail). We also compute the protein number distribution as well as the residence time distributions and averages of the high and low states (τH→L and τL→H, respectively) using discrete-time FSP55 and compare them between the “true” and MaxCal inferred models (see Table 1 and Figure 1). Furthermore, we report the extracted feedback parameter (Table 1) only for MaxCal. Note, this parameter cannot be evaluated for the input trajectory due to our inability to measure the production variable separately from the raw data. Figure 1 shows that MaxCal predicted distributions agree well with the ones from the synthetic data. These comparisons extend our previous work25 and conclude that the MaxCal model (eq 2) is capable of describing SGAA even when underlying details of mRNA expression are explicitly included to generate the synthetic data. MaxCal Accurately Infers Underlying Rates and Observables for TS Using Both Protein Trajectories (M2). To further test MaxCal’s ability to model and infer underlying details of more complex circuits, we analyze the TS circuit (modeled using eq 8) using the stochastic trajectories of both proteins, A and B. It is important to note that the present work differs from the previous work26 in four aspects: (i) the earlier work did not use synthetic trajectories from a reaction network with mRNA being explicitly present; (ii) the sampling time (Δt) was much smaller than typical experimental values; (iii) for simplification, it was assumed that KAα = 0 in eq 9; and (iv) the inference was based on moments of observables rather than the likelihood of the entire trajectory, causing data reduction. The caliber is now constructed using the most general set of constraints, and the likelihood function (eq 16) allows us to directly infer from trajectories, fully utilizing the entire data content. As before, the inference method was applied to 10 replicates of 100 trajectories with enough frames to equate to 7 days at a sampling rate of 5 min/frame (Δt = 300 s). Using the extracted parameters from each replicate, we computed the average and standard deviation of the effective rates (using eq 11) and feedback parameters (using eqs 12 and 14). Comparison of the effective rates and feedbacks to the “true” values from the stochastic input data (Table 2) demonstrates that MaxCal accurately and consistently infers underlying details even for realistic sampling times. It is also

SI = − ∑ Pi , kP(i → j),(k → l) log 2(Pi , kP(i → j),(k → l)) i ,j,k ,l

(23)

where Pi,k is the probability of having i number of A proteins (i.e., NA = i) and k number of B proteins (i.e., NB = k), while P(i→j),(k→l) is the probability of simultaneously transitioning NA from i to j and NB from k to l in a single frame. Since the two stable states in TS are both symmetric, we only consider a single “state entropy”, Ss. For each state, we calculate the path entropy only considering parts of the trajectory that are in that state and Ss is reported as the average of those values. Scg is still calculated in the same binary fashion as before. Again, due to symmetry between the states, we compute the average of the dwell time in either state as τ. As seen in Table 2, the MaxCal inferred entropy values and dwell time values are in good agreement with the input. Next, we compared the protein number distributions of A and B between the “true” synthetic data and the MaxCal generated model. Figure 2 demonstrates that the two distributions are in reasonable agreement. We further tested against additional synthetic data sets generated using different parameter values, and all metrics are again in good agreement, ensuring the robustness of the proposed methodology. MaxCal Accurately Infers Underlying Rates and Observables for TS Using Fluorescence Trajectories of Both Proteins (M3). Next, we consider typical conditions encountered experimentally when data is not available in protein number but rather in noisy fluorescence. This is particularly challenging because fluorescence per protein is a random variable itself. Using the synthetically generated fluorescence trajectories of both proteins, the inference method was again applied to 10 replicates of 100 trajectories with enough frames to equate to 7 days at a sampling rate of 5 min/ frame (Δt = 300 s). Using the likelihood function that incorporates this fluorescence fluctuation (eq 19), we find that G

DOI: 10.1021/acs.jpcb.7b12251 J. Phys. Chem. B XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry B

Figure 2. Predicted distributions agree well with the “true” distributions for M2. Protein number probability distributions (dark color for high probability) from (A) synthetic input trajectories and (B) predicted MaxCal trajectories. (C) Residence time probability distributions for synthetic input trajectories (blue) and predicted MaxCal trajectories (red). Synthetic input trajectory data was generated using d = 0.5 s−1, p = 0.02 s−1, f = 3.5 × 10−6 s−1, b = 2.0 × 10−5 s−1, g = 0.5 s−1, g* = 2.5 × 10−3 s−1, and r = 1.0 × 10−3 s−1 in eq 8 with sampling time Δt = 300 s. Representative MaxCal parameters are hα = 0.259, hA = 1.526, KAα = −0.034, KAβ = −0.244, and M = 31.

distributions of both the MaxCal inference and the “true” synthetic data, demonstrating a satisfactory overlap. MaxCal Accurately Infers Underlying Rates and Observables for TS Using Only One Protein Trajectory (M4). Next, we use the same TS circuit described in eq 8 but only provide the trajectory of one protein to MaxCal’s inference machinery. Thus, in M4, MaxCal is provided with information about only one species out of a total of seven species, providing a stronger test for MaxCal’s inferential power. While the same caliber function persists (eq 9), a new likelihood function accounting for this data reduction (eq 18) was used to determine the Lagrange multipliers. Again, using 10 replicates of 100 trajectories over 7 days with Δt = 300 s, effective rates and entropy values (SI, Ss, Scg) compare well to their “true” values from synthetic data, as seen in Table 4. The standard deviations in these quantities increase once again (compared to M2) but remain within an acceptable range. We notice that the magnitude of the autofeedback FAα and the mutual feedback FAβ is higher compared to M2. This can be attributed to the lack of knowledge about the second trajectory. The combined knowledge about the first and second trajectories encodes direct information about the cross-talk between A and B by

MaxCal inferred rates, entropies, and dwell times are again in good agreement with their “true” values and standard deviations increase slightly but remain moderate (see Table 3). Feedback values reported in Table 3 also compare well with Table 3. Comparison of True and Predicted Rates and Metrics Using MaxCal in M3 peff (s−1) p*eff (s−1) r (s−1) τ (s) SI (bits) Ss (bits) Scg (bits) FAα FAβ

true values

predicted values

20.0 × 10−3 0.1 × 10−3 1.0 × 10−3 110.4 × 103 11.1 10.2 1.02

(14.7 ± 1.4) × 10−3 (0.16 ± 0.03) × 10−3 (0.77 ± 0.06) × 10−3 (109.6 ± 95.7) × 103 11.3 ± 0.2 10.4 ± 0.2 1.02 ± 0.00 0.105 ± 0.011 −0.421 ± 0.014

the feedback inferred in M2. Figure 3 provides a further comparison by plotting the protein number and dwell time

Figure 3. Predicted distributions agree well with the “true” distributions for M3. Protein number probability distributions (dark color for high probability) from (A) synthetic input trajectories and (B) predicted MaxCal trajectories. (C) Residence time probability distributions for synthetic input trajectories (blue) and predicted MaxCal trajectories (red). The synthetic input trajectory is the same as that in M2 with f 0 = 100 and σ = 30 used to mimic realistic fluorescence fluctuation. Representative MaxCal parameters are hα = 0.311, hA = 1.493, KAα = −0.037, KAβ = −0.248, and M = 27. H

DOI: 10.1021/acs.jpcb.7b12251 J. Phys. Chem. B XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry B Table 4. Comparison of True and Predicted Rates and Metrics Using MaxCal in M4 peff (s−1) peff * (s−1) r (s−1) τ (s) SI (bits) Ss (bits) Scg (bits) FAα FAβ

true values

predicted values

20.0 × 10−3 0.1 × 10−3 1.0 × 10−3 110.4 × 103 11.1 10.2 1.02

(22.7 ± 3.1) × 10−3 (0.22 ± 0.06) × 10−3 (1.15 ± 0.17) × 10−3 (108.2 ± 97.0) × 103 10.7 ± 0.2 9.7 ± 0.2 1.02 ± 0.00 0.439 ± 0.137 −0.622 ± 0.083

Table 5. Comparison of True and Predicted Rates and Metrics Using MaxCal in M5 peff (s−1) peff * (s−1) r (s−1) τ (s) SI (bits) Ss (bits) Scg (bits) FAα FAβ

true values

predicted values

20.0 × 10−3 0.1 × 10−3 1.0 × 10−3 110.4 × 103 11.1 10.2 1.02

(25.8 ± 2.1) × 10−3 (0.31 ± 0.07) × 10−3 (1.37 ± 0.12) × 10−3 (121.1 ± 106.0) × 103 10.3 ± 0.2 9.5 ± 0.2 1.02 ± 0.00 0.710 ± 0.120 −0.781 ± 0.067

fluorescence trajectory, MaxCal is still able to capture fluctuations of the system and infer network details to reasonable accuracy. However, we notice that M4 and M5 do not capture trajectory fluctuations as well as M2 and M3 (see entropy metrics in Tables 2, 3, 4, and 5). This is either due to loss of information (in M4 and M5) and/or convolution with fluorescence (in M5). These issues may become magnified when analyzing data for circuits that have moderate repression (low FAβ) but strong autoactivation (high FAα). With these conditions, M4 and M5 may yield KAβ = 0 completely ignoring the cross-talk between A and B due to a lack of information on the other trajectory. On the basis of this, it is always advisible to tag both proteins if their cross-talk is known a priori.

providing the joint distribution of both proteins. In M4, we reduce the data given to our inference machinery and consequently erroneously estimate the direct cross-correlation captured in M2. The protein number distributions of the MaxCal inferred model also agree reasonably well compared to synthetic data (see Figure 4). These comparisons show that MaxCal can perform well even when the sampling time is sufficiently large and the input data is reduced to one protein. MaxCal Accurately Infers Underlying Rates and Observables for TS Using Only One Fluorescence Trajectory (M5). As in M3, we consider the experimental situation when data is not available in protein number but rather in noisy fluorescence. We, however, use the synthetically generated fluorescence trajectories of only one of the protein species (say protein A) to infer underlying details of the model. We provide 10 replicates of 100 trajectories over 7 days with a sampling time of Δt = 300 s. Maximizing the likelihood function that incorporates this fluorescence fluctuation (eq 21), we find that MaxCal inferred rates, entropies, and dwell times are again in good agreement with their “true” values with reasonable standard deviations (see Table 5). The magnitude of feedback strengths continues to increase higher than M2. We expect this is most likely for the combined effect of data reduction (similar to M4) and convolution with fluorescence fluctuation. Figure 5 further demonstrates protein number and dwell time distributions inferred using MaxCal agree reasonably well with that of the “true” synthetic data. This exercise confirms that, even for a two-gene mutually repressing circuit with available data on only one protein given in a noisy



CONCLUSION In summary, we have substantially extended the scope of MaxCal to analyze gene networks beyond previous efforts,25,26 focusing on switch-like circuits arising from effective positive feedback either by single gene autoactivation or two-gene mutual repression. We first demonstrated MaxCal’s ability to infer underlying details using the synthetic data generated from a single gene autoactivating circuit involving five different species (including mRNA dynamics). Next, we applied MaxCal to synthetic data generated from a two-gene mutually repressing TS circuit with mRNA explicitly modeled. For TS, we considered four different scenarios (M2, M3, M4, M5) with increasing levels of difficulty for MaxCal’s inference. First, in M2, we provided information about both proteins to MaxCal. Next, in M3, we provided the same information but in terms of

Figure 4. Predicted distributions agree well with the “true” distributions for M4. Protein number probability distributions (dark color for high probability) from (A) synthetic input trajectories and (B) predicted MaxCal trajectories. (C) Residence time probability distributions for synthetic input trajectories (blue) and predicted MaxCal trajectories (red). The synthetic input trajectories are the same as those used in M2. Representative MaxCal parameters are hα = 0.422, hA = 0.992, KAα = −0.034, KAβ = −0.267, and M = 16. I

DOI: 10.1021/acs.jpcb.7b12251 J. Phys. Chem. B XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry B

Figure 5. Predicted distributions agree well with the “true” distributions for M5. Protein number probability distributions (dark color for high probability) from (A) synthetic input trajectories and (B) predicted MaxCal trajectories. (C) Residence time probability distributions for synthetic input trajectories (blue) and predicted MaxCal trajectories (red). Synthetic input trajectories are the same as those used in M2 with f 0 = 100 and σ = 30 used to mimic realistic fluorescence fluctuation. Representative MaxCal parameters are hα = 0.438, hA = 0.657, KAα = −0.020, KAβ = −0.264, and M = 11.

fluorescence instead of protein number to address a typical challenge with data from experiment. To test MaxCal’s inferential power further, we feigned ignorance to the presence of one protein and only provided the fluctuating time trace of the other protein. This model, M4, further extends MaxCal’s applicability to increased numbers of hidden species. Finally, in M5, we provide the same fluctuating time trace of one of the proteins (similar to M4), but in fluorescence instead of protein number. In all four cases, MaxCal is able to describe the underlying fluctuations as quantified by different entropies, average dwell times, and distributions of protein numbers compared against the known model used to generate the synthetic data. Furthermore, MaxCal yields several effective parameters that are not directly visible in experiments, such as the effective protein production and degradation rates in both the basal and repressed states. These inferred rates are in reasonable agreement with the “true” values. Interestingly, MaxCal can also provide effective feedback parameters (for promotion or repression between all species) from the fluctuations of the data. These parameters add an important set of metrics that further quantify these circuits. This can be particularly useful when analyzing the same circuit topology with varying conditions so as to alter the feedback strength in the system or even eliminate feedback altogether25 (i.e., bimodality to unimodality). Under stress, genomes of microbes and cancer cells may evolve circuits with varying degrees of feedback.72 Another interesting application of MaxCal would be to explore its ability to select between two different circuit topologies that give rise to the same switch-like behavior, e.g., SGAA (only positive feedback) and TS (mutually repressing two-gene circuit). These examples highlight possible applications of MaxCal to broader problems in synthetic biology beyond the specific model systems investigated here.





ACKNOWLEDGMENTS



REFERENCES

We acknowledge support from NSF (award number 1149992) and RCSA (as a Cottrell Scholar). We thank Ken Dill and Rob Phillips for introducing us to the concept and application of Maximum Caliber in biology. We appreciate many stimulating discussions with Gabor Balázsi, Steve Pressé, and Brian Munsky on many aspects of MaxCal, FSP, and stochastic data analysis. We acknowledge the High Performance Computing (HPC) facility at the University of Denver and Ben Fotovich for computing assistance.

(1) Gardner, T.; Cantor, C.; Collins, J. Construction of a Genetic Toggle Switch in Escherichia coli. Nature 2000, 403, 339−342. (2) Elowitz, M.; Leibler, S. A Synthetic Oscillatory Network of Transcriptional Regulators. Nature 2000, 403, 335−338. (3) Alon, U. Network Motifs: Theory and Experimental Approaches. Nat. Rev. Genet. 2007, 8, 450−461. (4) Tsai, T.; Choi, Y.; Ma, W.; Pomerening, J.; Tang, C.; Ferrell, J. Robust, Tunable Biological Oscillations from Interlinked Positive and Negative Feedback Loops. Science 2008, 321, 126−129. (5) Nevozhay, D.; Adams, R.; Itallie, E. V.; Bennett, M.; Balázsi, G. Mapping the Environmental Fitness Landscape of a Synthetic Gene Circuit. PLoS Comput. Biol. 2012, 8, e1002480. (6) Nevozhay, D.; Adams, R.; Murphy, K.; Josic, K.; Balázsi, G. Negative Autoregulation Linearizes the Dose-Response and Suppressed the Heterogeneity of Gene Expression. Proc. Natl. Acad. Sci. U. S. A. 2009, 106, 5123−5128. (7) Ozbudak, E.; Thattai, M.; Kurtser, I.; Grossman, A.; van Oudenaarden, A. Regulation of Noise in the Expression of a Single Gene. Nat. Genet. 2002, 31, 69−73. (8) Kaern, M.; Elston, T.; Blake, W.; Collins, J. Stochasticity in Gene Expression: From Theories to Phenotypes. Nat. Rev. Genet. 2005, 6, 451−464. (9) Paulsson, J. Summing Up the Noise in Gene Networks. Nature 2004, 427, 415−418. (10) Samoilov, M.; Plyasunov, S.; Arkin, A. Stochastic Amplification and Signaling in Enzymatic Futile Cycles through Noise-Induced Bistability with Oscillations. Proc. Natl. Acad. Sci. U. S. A. 2005, 102, 2310−2315. (11) Sanchez, A.; Kondev, J. Transcriptional Control of Noise in Gene Expression. Proc. Natl. Acad. Sci. U. S. A. 2008, 105, 5081−5086. (12) Shahrezaei, V.; Swain, P. The Stochastic Nature of Biochemical Networks. Curr. Opin. Biotechnol. 2008, 19, 369−374.

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Kingshuk Ghosh: 0000-0003-4976-0986 Notes

The authors declare no competing financial interest. J

DOI: 10.1021/acs.jpcb.7b12251 J. Phys. Chem. B XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry B

(37) Wu, F.; Wang, X. Applications of Synthetic Gene Networks. Sci. Prog. 2015, 98, 244−252. (38) Keller, A. Model Genetic Circuits Encoding Autoregulatory Transcription Factors. J. Theor. Biol. 1995, 172, 169−185. (39) Smolen, P.; Baxter, D.; Byrne, J. Frequency, Selectivity, Multistability, and Oscillations Emerge from Models of Genetic Regulatory Systems. Am. J. Physiol. 1998, 274, C531−C542. (40) Becskei, A.; Seraphin, B.; Serrano, L. Positive Feedback in Eukaryotic Gene Networks: Cell Differentiation by Graded to Binary Response Conversion. EMBO J. 2001, 20, 2528−2535. (41) Tyson, J.; Chen, K.; Novak, B. Sniffers, Buzzers, Toggles and Blinkers: Dynamics of Regulatory and Signaling Pathways in the Cell. Curr. Opin. Cell Biol. 2003, 15, 221−231. (42) Cheng, Z.; Liu, F.; Zhang, X.; Wang, W. Robustness Analysis of Cellular Memory in an Autoactivating Positive Feedback System. FEBS Lett. 2008, 582, 3776−3782. (43) Bishop, L.; Qian, H. Stochastic Bistability and Bifurcation in a Mesoscopic Signaling System with Autocatalytic Kinase . Biophys. J. 2010, 98, 1−11. (44) Frigola, D.; Casanellas, L.; Sancho, J.; Ibanes, M. Asymmetric Stochastic Switching Driven by Intrinsic Molecular Noise. PLoS One 2012, 7, e31407. (45) Faucon, P.; Pardee, K.; Kumar, R.; Li, H.; Loh, Y.-H.; Wang, X. Gene Networks of Fully Connected Triads with Complete AutoActivation Enable Multistability and Stepwise Stochastic Transitions. PLoS One 2014, 9, e102873. (46) Kepler, T.; Elston, T. Stochasticity in Transcriptional Regulation: Origins, Consequences, and Mathematical Representations. Biophys. J. 2001, 81, 3116−3136. (47) Phillips, R.; Kondev, J.; Theriot, J. Physical Biology of the Cell; Garland Science: New York, 2009. (48) Ghosh, K.; Dill, K.; Inamdar, M.; Seitaridou, E.; Phillips, R. Teaching the Principles of Statistical Dynamics. Am. J. Phys. 2006, 74, 123−133. (49) Seitaridou, E.; Inamdar, M.; Phillips, R.; Ghosh, K.; Dill, K. Measuring Flux Distributions for Diffusion in the Small-Numbers Limit. J. Phys. Chem. B 2007, 111, 2288−2292. (50) Wu, D.; Ghosh, K.; Inamdar, M.; Lee, H.; Fraser, S.; Dill, K.; Phillips, R. Trajectory Approach to Two-State Kinetics of Single Particles on Sculpted Energy Landscapes. Phys. Rev. Lett. 2009, 103, 050603. (51) Otten, M.; Stock, G. Maximum Caliber Inference of Nonequilibrium Processes. J. Chem. Phys. 2010, 133, 034119. (52) Pressé, S.; Ghosh, K.; Phillips, R.; Dill, K. Dynamical Fluctuations in Biochemical Reactions and Cycles. Phys. Rev. E 2010, 82, 031905. (53) Ghosh, K. Stochastic Dynamics of Complexation Reaction in the Limit of Small Numbers. J. Chem. Phys. 2011, 134, 195101. (54) Dixit, P.; Wagoner, J.; Weistuch, C.; Pressé, S.; Ghosh, K.; Dill, K. Perspective: Maximum Caliber is a General Variational Principle for Dynamical Systems. J. Chem. Phys. 2018, 148, 010901. (55) Munsky, B.; Khammash, M. The Finite State Projection Algorithm for the Solution of the Chemical Master Equation. J. Chem. Phys. 2006, 124, 044104. (56) Lawrimore, J.; Bloom, K.; Salmon, E. Point Centromeres Contain More than a Single Centromere-Specific Cse4 (CENP-A) Nucleosome. J. Cell Biol. 2011, 195, 573−582. (57) Taniguchi, Y.; Choi, P.; Li, G.; Chen, H.; Babu, M.; Hearn, J.; Emili, A.; Sunney, X. Quantifying E-coli Proteome and Transcriptome with Single-Molecule Sensitivity in Single Cells. Science 2010, 329, 533−538. (58) Tsekouras, K.; Custer, T.; Jashnsaz, H.; Walter, N.; Pressé, S. A Novel Method to Accurately Locate and Count Large Numbers of Steps by Photobleaching. Mol. Biol. Cell 2016, 27, 3601−3615. (59) Coffman, V.; Wu, J. Counting Protein Molecules Using Quantitative Fluorescence Microscopy. Trends Biochem. Sci. 2012, 37, 499−506.

(13) Elowitz, M.; Levine, A.; Siggia, E.; Swain, P. Stochastic Gene Expression in a Single Cell. Science 2002, 297, 1183−1186. (14) Tao, Y. Intrinsic and External Noise in an Auto-Regulatory Genetic Network. J. Theor. Biol. 2004, 229, 147−156. (15) Beard, D.; Qian, H. Chemical Biophysics: Quantitative Analysis of Cellular Systems; University Press: Cambridge, U.K., 2008. (16) Munsky, B.; Trinh, B.; Khammash, M. Listening to the Noise: Random Fluctuations Reveal Gene Network Parameters. Mol. Syst. Biol. 2009, 5, 318. (17) Lillacci, G.; Khammash, M. Parameter Estimation and Model Selection in Computational Biology. PLoS Comput. Biol. 2010, 6, e1000696. (18) Zechner, C.; Ruess, J.; Krenn, R.; Pelet, S.; Peter, M.; Lagers, J.; Koeppl, H. Moment-Based Inference Predicts Bimodality in Transient Gene Expression. Proc. Natl. Acad. Sci. U. S. A. 2012, 109, 8340−8345. (19) Lillacci, G.; Khammash, M. A Distribution-Matching Method for Parameter Estimation and Model Selection in Computational Biology. Int. J. Robust and Nonlinear Control 2012, 22, 1065−1081. (20) Ruess, J.; Milias-Argeitis, A.; Lygeros, J. Designing Experiments to Understand the Variability in Biochemical Reaction Networks. J. R. Soc., Interface 2013, 10, 20130588. (21) Lillacci, G.; Khammash, M. The Signal within the Noise: Efficient Inference of Stochastic Gene Regulation Models Using Fluorescence Histograms and Stochastic Simulations. Bioinformatics 2013, 29, 2311−2319. (22) Dill, K.; Bromberg, S. Molecular Driving Forces: Statistical Thermodynamics in Chemistry and Biology; Garland Science: New York, 2003. (23) Gillespie, D. Exact Stochastic Simulation of Coupled Chemical Reactions. J. Phys. Chem. 1977, 81, 2340−2361. (24) Kauffman, S. A Proposal for Using the Ensemble Approach to Understand Genetic Regulatory Networks. J. Theor. Biol. 2004, 230, 581−590. (25) Firman, T.; Balazsi, G.; Ghosh, K. Building Predictive Models of Genetic Circuits Using the Principle of Maximum Caliber. Biophys. J. 2017, 113, 2121−2130. (26) Pressé, S.; Ghosh, K.; Dill, K. Modeling Stochastic Dynamics in Biochemical Systems with Feedback Using Maximum Caliber. J. Phys. Chem. B 2011, 115, 6202−6212. (27) Pressé, S.; Ghosh, K.; Lee, J.; Dill, K. Principle of Maximum Entropy and Maximum Caliber in Statistical Physics. Rev. Mod. Phys. 2013, 85, 1115−1141. (28) Pressé, S.; Peterson, J.; Lee, J.; Elms, P.; MacCallum, J.; Marqusee, S.; Bustamante, C.; Dill, K. Single Molecule Conformational Memory Extraction: P5ab RNA Hairpin. J. Phys. Chem. B 2014, 118, 6597−6603. (29) Dixit, P.; Dill, K. Inferring Microscopic Kinetic Rates from Stationary State Distributions. J. Chem. Theory Comput. 2014, 10, 3002−3005. (30) Dixit, P.; Jain, A.; Stock, G.; Dill, K. Inferring Transition Rates of Networks from Populations in Continuous-Time Markov Processes. J. Chem. Theory Comput. 2015, 11, 5464−5472. (31) Wan, H.; Zhou, G.; Voelz, V. A Maximum-Caliber Approach to Predicting Perturbed Folding Kinetics Due to Mutations. J. Chem. Theory Comput. 2016, 12, 5768−5776. (32) Lipshtat, A.; Loinger, A.; Balaban, N.; Biham, O. Genetic Toggle Switch without Cooperative Binding. Phys. Rev. Lett. 2006, 96, 188101. (33) Murphy, K.; Balazsi, G.; Collins, J. Combinatorial Promoter Design for Engineering Noisy Gene Expression. Proc. Natl. Acad. Sci. U. S. A. 2007, 104, 12726−12731. (34) Lyons, S.; Xu, W.; Medford, J.; Prasad, A. Loads Bias Genetic and Signaling Switches in Synthetic and Natural Systems. PLoS Comput. Biol. 2014, 10, e1003533. (35) Wang, L.; Wu, F.; Flores, K.; Lai, Y.; Wang, X. Build to Understand: Synthetic Approaches to Biology. Integr. Biol. 2016, 8, 394−408. (36) Mukherji, S.; van Oudenaarden, A. Synthetic Biology: Understanding Biological Design from Synthetic Circuits. Nat. Rev. Genet. 2009, 10, 859−871. K

DOI: 10.1021/acs.jpcb.7b12251 J. Phys. Chem. B XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry B (60) Coffman, V.; Wu, P.; Parthun, M.; Wu, J. CENP-A Exceeds Microtubule Attachment Sites in Centromere Clusters of Both Budding and Fission Yeast. J. Cell Biol. 2011, 195, 563−572. (61) Engel, B.; Ludington, W.; Marshall, W. Intraflagellar Transport Particle Size Scales Inversely with Flagellar Length: Revisiting the Balance-Point Length Control Model. J. Cell Biol. 2009, 187, 81−89. (62) Leake, M.; Chandler, J.; Wadhams, G.; Bai, F.; Berry, R.; Armitage, J. Stoichiometry and Turnover in Single, Functioning Membrane Protein Complexes. Nature 2006, 443, 355−358. (63) Ulbrich, M.; Isacoff, E. Subunit Counting in Membrane-Bound Proteins. Nat. Methods 2007, 4, 319−321. (64) Das, S.; Darshi, M.; Cheley, S.; Wallace, M.; Bayley, H. Membrane Protein Stoichiometry Determined from the Step-Wise Photobleaching of Dye-Labelled Subunits. ChemBioChem 2007, 8, 994−999. (65) Shu, D.; Zhang, H.; Jin, J.; Guo, P. Counting of Six pRNAs of Phi29 DNA-packaging Motor with Customized Single-Molecule DualView System. EMBO J. 2007, 26, 527−537. (66) Delalez, N.; Wadhams, G.; Rosser, G.; Xue, Q.; Brown, M.; Dobbie, I.; Berry, R.; Leake, M.; Armitage, J. Signal-Dependent Turnover of the Bacterial Flagellar Switch Protein FliM. Proc. Natl. Acad. Sci. U. S. A. 2010, 107, 11347−11351. (67) Demuro, A.; Penna, A.; Safrina, O.; Yeromin, A.; Amcheslavskey, A.; Cahalan, M.; Parker, I. Subunit Stoichiometry of Human Orai1 and Orai3 Channels in Closed and Open States. Proc. Natl. Acad. Sci. U. S. A. 2011, 108, 17832−17837. (68) Hastie, P.; Ulbrich, M.; Wang, H.; Arant, R.; Lau, A.; Zhang, Z.; Isacoff, E.; Chen, L. AMPA Receptor/TARP Stoichiometry Visualized by Single-Molecule Subunit Counting. Proc. Natl. Acad. Sci. U. S. A. 2013, 110, 5163−5168. (69) Arumugam, S.; Lee, T.; Benkovic, S. Investigation of Stoichiometry of T4 Bacteriophage Helicase Loader Protein (gp59). J. Biol. Chem. 2009, 284, 29283−29289. (70) Pitchiaya, S.; Androsavich, J.; Walter, N. Intracellular Single Molecule Microscopy Reveals Two Kinetically Distinct Pathways for MicroRNA Assembly. EMBO Rep. 2012, 13, 709−715. (71) Pitchiaya, S.; Krishnan, V.; Custer, T.; Walter, N. Dissecting Non-Coding RNA Mechanisms In Cellulo by Single-Molecule HighResolution Localization and Counting. Methods 2013, 63, 188−199. (72) González, C.; Ray, J.; Manhart, M.; Adams, R.; Nevozhay, D.; Morozov, A.; Balázsi, G. Stress-Response Balance Drives the Evolution of a Network Module and its Host Genome. Mol. Syst. Biol. 2015, 11, 827.

L

DOI: 10.1021/acs.jpcb.7b12251 J. Phys. Chem. B XXXX, XXX, XXX−XXX