Estimation and Inference of Diffusion Coefficients in Complex

Jan 7, 2011 - Hypothesis testing machinery, with adequate power in small samples, ..... In terms of statistical mechanics, this translates into the st...
0 downloads 0 Views 62KB Size
Supporting Information for Estimation and Inference of Diffusion Coefficients in Complex Biomolecular Environments Christopher P. Calderon January 5, 2011

1 The Ornstein-Uhlenbeck Model The transition density, p(zt+∆t |zt , θ) associated with this model can be written in closed form, it takes the form of a Gaussian distribution characterized by mean α(1−exp(−κ∆t))+ ˜ zt exp(−κ∆t) and variance Dκ (1−exp(−2κ∆t)). Furthermore, for evenly spaced data, one can compute the Maximum Likelihood Estimate (MLE) of the parameter without the need of a numerical optimization procedure. 1 MATLAB scripts demonstrating how to apply these results to time series data are available in Ref. 2

2 Goodness-of-Fit Tests The transition density is used for computing the time series needed for the M and Q test statistics (the information needed to compute the Ornstein Uhlenbeck transition density is provided above). To accomplish this, one uses the observed data zt+∆t and the assumed zR τ ∆t ˆ transition density to compute Zˆ ∆t := p(x|z(τ −1)∆t , θ)dx for an estimated (or assumed) τ

−∞

ˆ For simplicity the ∆t superscript will be omitted and a discrete time series index, τ , θ. will be used in what follows. The computed Zˆτ ’s are then used to compute the quantities derived in Ref., 3 namely:

M(m, l) ≡

−1 N X

w 2 (j/p)(N − j)ˆ ρ2ml (j) −

j=1

N −1 X j=1

−2   N  12 X w 2 (j/p) / 2 w 4 (j/p)

(1)

j=1

and  1  ˆ ˆ − hA0 /V02 Q(j) ≡ h(N − j)I(j) 1

(2)

Z1 Z1  2 ˆ ≡ I(j) gˆj (Z1 , Z2 ) − 1 dZ1 dZ2 . 0

0

N X 1 Kh (Z1 , Zˆτ )Kh (Z2 , Zˆτ −j ) gˆj (Z1 , Z2 ) ≡ N − j τ =j+1

A full description of the terms appearing above is given in Ref. 3 where they are derived, but the essence of the above two test statistics can be understood by noting that h is a bandwidth used along with a specialized nonparametric kernel function Kh (·, ·) aiming to estimate a U[0, 1] distribution on the square (and also take into account boundary effects) given a finite set of observations and N is the number entries in the observed time series. In the definition of Q, j is a user specified index that selects the spacing between the Zˆτ ’s (a value of j = 1 is commonly used 3 ). In the definition of M, ρˆml (j) denotes the sample cross correlation between Zτm and Zτl −|j| (stationarity is assumed) and w(·) is the Bartlett kernel; 3 note that in M(m, l), the user no longer selects j but one specifies the integer parameters m and l. The integer truncation parameter p is needed for computation, but often has little influence over the results; 3 in this study p was set to 20. The expression for the remaining constants A0 and V0 are give in Ref. 3 In this article, results using Q(1) and M(1, 1), are reported.

3 Rough Guidelines to More General SDE Modeling A descriptive outline of a suggested approach for fitting SDEs to MD simulation or experimental times series is presented below. This type of approach is fairly standard in statistical model building, but comments relevant to chemical physics and coarse graining are made for the reader’s benefit. • 1) Inspect Individual Trajectories Look for qualitative features in the sample trajectories. For example, is there a single common mean that the data oscillates around or is there collection of mean reverting centers? Does the noise magnitude appear to be constant for all positions? Use qualitative and quantitative observations made here to guide next step below. The various descriptions will likely depend on the time scale one is interested in modeling. • 2) Select Candidate Model Class Based on the types of observations made above and on the time scale of interest, select a candidate surrogate model class. For example, if one is interested in the curvature of a deep “well minimum” and the effective noise appears to be constant over the time scale observations are made over, then the classic Ornstein Uhlenbeck model would be a reasonable candidate parametric model class. The validity of this proposed model is quantitatively assessed in step 4) below. 2

• 3) Estimate a Plausible Range of Drift and Diffusion Functions Following the sound advice outlined in Ref., 4 obtain a reliable initial guess using a closed form estimator (e.g. if using a nonlinear drift, consider building off of OU initial guesses and setting higher order terms to zero). Refine the initial estimate using a more complicated estimator or set of estimating equations. If the refinement is suspect, retain the estimate, but utilize another consistent estimating scheme 4 and subject all estimator output to the step below. • 4) Utilize Observational Data and Perform In-Sample Diagnostics Once a model is selected and estimated, develop a suite of test statistics for assessing the implications of the model making use of the data used to calibrated the stochastic model. Transition density based tests are useful because one can make use of all of the moments (and the assumed temporal dependence structure) and test this against the more complex data in hand. The goal is to see if there is a enough evidence in the data to justify rejecting the surrogate. Various goodness-of-fit tests should be employed because different tests tend to focus more heavily on uniquely different alternatives (e.g. some tests focus more heavily on non-Markovian autocorrelations as demonstrated in this article). • 5) Refine Model Class If necessary, return to step 2). For rejected models, analysis of the residuals (or generalized residuals) of a simple hypothesized model can greatly assist one in proposing new features to the candidate model. The M-statistics of Hong and Li have proven useful in this context; 3 this type of analysis can help one in determining if the drift is nonlinear and/or if a constant diffusive noise assumption is questionable. Proceed to step 6) if a satisfactory model or class of models is found (model rejected by statistical hypothesis tests may still have merit). • 6) Assess Model Predictions Just because a model passes appropriate goodness of fit tests does not imply that the model will be able to be used for forecasting. Conversely, rejected models can have a redeeming quality and be useful for predictions. It is important to make the distinction and systematically determine what can and cannot be inferred from a simple surrogate model calibrated from a complex system. • 7) Summarize Inferred Information The classic approach to this is to estimate a single dynamical model and report confidence bands (which may be obtained either through asymptotic arguments of Monte Carlo simulation). However, if non-ergodic sampling is present, tools like mixed effects models may be useful in summarizing the information content in a collection of time series. In this author’s opinion, tools making better use of simulation and experimental data are needed to help in understanding biological systems at small length and time scales.

3

9 7

PMF [kBT]

5 3 1 −1 −3 −5 −7

−1.4

−1.2

−1

−0.8

−0.6

−0.4

−0.2

0

z [nm] Figure 1: Potential of Mean Force (PMF) computed in Ref. 5 The region near -1 nm is the binding pocket this studies focuses on.

4 Mixed Effects Models Portions of the setup of the mixed effects models presented in the text will be repeated in what follows with some additional descriptions to assist in clarifying the technical details of the model. The setup is as follows: ˜ ˜ i,j = µD˜ + bD D i + ǫi,j i = 1, 2, . . . , NIC j = 1, 2, . . . , NRep ,

(3)

where the left hand side in these mixed effects models is the local diffusion coefficient associated with a given θˆ(n) estimated from a finite discretely sampled time series, ǫi,j represents the residuals of the regression ( ǫi,j can be thought of as modeling the sampling ˜ noise due to a finite number of time series samples), µD denotes a fixed-effect population ˜ mean of the local diffusion coefficients and bD i denotes a random-effect specific to initial condition i. NIC represents the number of initial configurations (ICs) analyzed, where an IC is defined by the position of all atoms in a simulation; NRep denotes the number of repeat experiments for a given fixed IC (the position of all atoms is fixed, but different 4

initial velocities are used). To fit these mixed effects model above, the nlme package in the R language was used with the default settings; 6 the random effects and residuals are modeled as normal random variables; the variance of the random effects and residuals are estimated from observed data. More explicitly ǫi,j ∼ N (0, σǫ2) where this notation indicates that all residuals (independent of index) are fit to a model assuming that these quantities come from a normal distribution possessing a mean of zero and variance (estimated from ˜ 2 data) given by σǫ2 . Similarly for bD ˜ ). The significance of the random term, i ∼ N (0, σD and hence dependence on initial condition, is tested using both the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC). Because ergodic sampling may not occur, the random-effect, i.e. variation due to the initial condition, may be statistically significant. For cases where non-ergodic sampling occurs, this type of approach shows promise for 1) detecting this situation and 2) providing a less coarse-grained description of the data.

References (1) Tang, C.; Chen, S. J. Econometrics 2009, 149, 65–81. (2) Calderon, C.; Arora, K. J. Chem. Theory & Comput. 2009, 5, 47. (3) Hong, Y.; Li, H. Rev. Fin. Studies 2005, 18, 37–84. (4) Le Cam, L. ISI Review 1990, 58, 153–171. (5) Calderon, C.; Janosi, L.; Kosztin, I. J. Chem. Phys. 2009, 130, 144908. (6) Pinheiro, J.; Bates, D.; DebRoy, S.; Sarkar, D.; the R Core team, nlme: Linear and Nonlinear Mixed Effects Models; 2009, R package version 3.1-96.

5

60 50

Q(1)

40 30 20 10 0 −10 −2

−1.5

−1

−0.5

0

0.5

1

1.5

2

−1.5

−1

−0.5

0

0.5

1

1.5

2

log10(∆ t [ps])

60 50

M(1,1)

40 30 20 10 0 −10 −2

log10(∆ t [ps])

Figure 2: Identical to Fig. 3 in main text except time series were generated using a constraining potential. The resulting fitted SDEs performed similarly in regards to goodnessof-fit. 6