Absolute Copy Number from the Statistics of the Quantification Cycle

Jan 11, 2015 - amount, or copy number (N0), of the target DNA. Cq may be ... standard deviation (SD) for preparing a reaction having copy number N0 is...
0 downloads 0 Views 1MB Size
Article pubs.acs.org/ac

Absolute Copy Number from the Statistics of the Quantification Cycle in Replicate Quantitative Polymerase Chain Reaction Experiments Joel Tellinghuisen* Department of Chemistry, Vanderbilt University, Nashville, Tennessee 37235, United States

Andrej-Nikolai Spiess Department of Andrology, University Hospital Hamburg-Eppendorf, 20246 Hamburg, Germany S Supporting Information *

ABSTRACT: The quantification cycle (Cq) is widely used for calibration in realtime quantitative polymerase chain reaction (qPCR), to estimate the initial amount, or copy number (N0), of the target DNA. Cq may be defined several ways, including the cycle where the detected fluorescence achieves a prescribed threshold level. For all methods of defining Cq, the standard deviation from replicate experiments is typically much greater than the estimated standard errors from the least-squares fits used to obtain Cq. For moderate-to-large copy number (N0 > 102), pipet volume uncertainty and variability in the amplification efficiency (E) likely account for most of the excess variance in Cq. For small N0, the dispersion of Cq is determined by the Poisson statistics of N0, which means that N0 can be estimated directly from the variance of Cq. The estimation precision is determined by the statistical properties of χ2, giving a relative standard deviation of ∼(2/n)1/2, where n is the number of replicates, for example, a 20% standard deviation in N0 from 50 replicates.

S

the threshold definition of Cq (often labeled Ct), N = Nq when y = yq, giving

ince its advent over two decades ago, real-time quantitative polymerase chain reaction (qPCR) has been the standard method for determining very small amounts of genetic material,1 accomplished by amplifying the target gene (amplicon) by many orders of magnitude to make it detectable. This occurs through a cyclical process in which the target material and appropriate reagents are alternately heated and cooled, approximately doubling the number of amplicons N with each cycle. The process is monitored by fluorescence, and it takes 10−30 cycles for the target fluorescence to rise perceptibly above a baseline from the starting materials in the reaction. For quantification, the standard procedure has been to compare the amplification profile for the unknown with similar profiles obtained for the same amplicon at several known initial copy numbers (N0). To that end, it is necessary to establish a marker on each profile that can be related reliably to the N0 for that reaction. The quantification cycle (Cq) plays that role; it can be defined a number of ways (Figure 1) and is related most simply to the logarithm of N0, as can be obtained from the equation for amplification in the early cycles, N = N0E x

Cq =

log(E)

(2)

Ideally, a plot of Cq vs log(N0) for the knowns is linear, with slope −1/log(E). The Cq for the unknown then determines its N0. The same relation can be used for other definitions of Cq, like the first- and second-derivative maxima, and Cy0 (Figure 1), so these can be used equivalently in calibration, often more reliably than the threshold Cq.2,3 This is true as long as the curves have constant shape for all N0, even though eq 1 can be valid only for the baseline and early growth region at most, since E must decline to 1 as the reaction saturates in the large-x plateau region. When amplification profiles are recorded for many replicate reactions (reps) in a dilution series, the profiles for the most dilute reactions may display anomalous excess dispersion along the cycle axis, as can be seen in Figure 2 for a dilution series notable for the very large number of replicates at each concentration.4 This effect has been attributed to the

(1)

where E is the amplification efficiency (ideally 2, usually smaller, and assumed to be constant) and x is the cycle number. Assuming that the fluorescence intensity y is proportional to N, we have the same equation for y, with y0 replacing N0. Using © XXXX American Chemical Society

log(Nq) − log(N0)

Received: October 30, 2014 Accepted: January 11, 2015

A

DOI: 10.1021/acs.analchem.5b00077 Anal. Chem. XXXX, XXX, XXX−XXX

Article

Analytical Chemistry

number N0 is (N0)1/2. With that interpretation, it is possible to estimate the copy number from N0 = [(ln(E))2 σCq 2]−1

(3)

which is determined almost entirely by the sampling variance in Cq, requiring just a reasonable estimate of E. Further, since variance estimation follows the statistics of χ2, the relative SD is that of χ2: ∼(2/n)1/2, where n is the number of replicates. For the most dilute reps included in Figure 2, this treatment yields N0 = 14 ± 2, in remarkable agreement with the stated preparation. The derivation and implementation of eq 3 and the exploration of other sources of excess variance in Cq, especially variability in E and amplification fluctuations, occupy the present study. Equation 3 represents a “calibration-free” method of estimating the copy number. Such methods have attracted much recent attention. “Digital” PCR (dPCR) originated about the same time as conventional qPCR,8 with many practical demonstrations within a few years.9 For good performance, these methods require an average N0 < 1, so that most reactions will be nulls (no product) and only a small fraction will have N0 > 1. In the same low-N0 regime, the qPCR amplification profiles spread out along the cycle axis, permitting identification of profiles for N0 = 1, 2, etc., in what has been called “limiting dilution assay” (LDA).10 Recently, drop and microchannel methods have made it easier to collect the large numbers of reps needed for good precision in dPCR.11−13 In contrast with all of these methods, the use of Cq variance for estimating N0 should be optimal for N0 in the range of 10−200, perhaps extendable by another factor of ∼3. Compared with the other methods, this means reduced dilution demands, and the data can be collected with standard qPCR equipment. The only other calibration-free method we know of for this N0 regime is LRE with optical calibration,10 which can give excellent results even for single-reaction data, but which becomes unreliable for profiles showing high asymmetry in the transition region (see below). Very recently, while the present work was in preparation, Mojtahedi et al. described a method that is, in principle, very similar to ours.14 These authors implement their method by directly fitting experimental Cq distributions to expressions based on the Poisson distribution for N0. In so doing, they tacitly attribute all dispersion in Cq to Poisson uncertainty. As we show below, there are other significant contributions, notably from variability in the pipetted volume and in E and from the least-squares imprecision in the estimates of Cq. Neglect of these contributions will lead to underestimation of N0, as was found in ref 14. Compared with fitting Cq distributions, our Cq variance method focuses directly on the property of Cq that relates to N0 and is simpler to apply when effects other than the initial Poisson distribution contribute to the Cq dispersion.

Figure 1. Synthetic qPCR curve, illustrating four common location indices used as Cq: threshold cycle Ct = 11.4 (for yq = 0.44, horizontal dashed line), second derivative maximum (SDM) = 14.0, first derivative maximum (FDM) = 16.0, and Cy0 = 13.0. The last is defined as the intersection of the baseline with a line tangent to the curve at the FDM.



Figure 2. qPCR profiles from the 94 × 4 Reps technical data set from ref 4. The relevant copy numbers N0 are stated to be 15 000 (left), 1500, 150, and 15 (right). The raw curves are shown at the top, baseline-corrected in the middle, and fully scaled at the bottom. Note the greater dispersion of the curves along the cycle axis for the most dilute reactions in the scaled data.

MATERIALS AND METHODS Least-Squares Computations. Many of the LS calculations described here were done using the KaleidaGraph program (KG, Synergy Software), as in our earlier works on qPCR analysis.3,15 This program has a user-friendly routine for nonlinear fitting, weighted and unweighted.16 It employs the Marquardt algorithm to facilitate convergence in nonlinear fitting,17 and importantly, its output includes parameter standard error (SE) estimates. However, it is designed for

limitations of Poisson statistics,5−7 according to which the standard deviation (SD) for preparing a reaction having copy B

DOI: 10.1021/acs.analchem.5b00077 Anal. Chem. XXXX, XXX, XXX−XXX

Article

Analytical Chemistry one-at-a-time analysis of data sets, and in the present work, we need to fit many replicate reactions efficiently. For that purpose, we have coded our own programs in Microsoft FORTRAN and in the R language.15,18 These programs will be described in more detail elsewhere; however, they do not differ significantly from Marquardt-based routines that have long been available.19,20 We use the threshold definition of Cq in these calculations, fitting the early growth region to eq 1 plus a baseline function, y(x) = bas(x) + y0 E x ≡ bas(x) + yq E(x − Cq)

which performs much better than conventional 3-parameter versions on data with asymmetry in the transition region.23 (These are often referred to as 5- and 4-parameter LL functions, the extra parameter being a single constant for the baseline.) For the 94 × 4 Reps data, we find the best performance fitting to ∼20 cycles centered on the transition region, and here, a quadratic baseline, bas(x) = a + b(x − g ) + c(x − g )2

works better than the saturation function of eq 5. This function is centered at x = g, so a represents the magnitude of the baseline in roughly the middle of the transition region. To specify the region to be fitted, we use the finite-difference method described by Boggy and Woolf24 to obtain the approximate SDM. Here, the first derivative is approximated as di = yi+1 − yi, and the second derivative is similarly estimated from the di until a maximum is observed. We then fit from nb cycles below this to na above. For the 94 × 4 Reps data, the choices nb ≈ 8 and na ≈ 12 give near-optimal results. When the fit has converged, the LL4 parameters are used to calculate a precise SDM, which is then used to set the fit range (cycles 1SDM) for the fit to eq 4, in which yq is taken as ∼0.15 times the LL4 amplitude parameter d. (Similar results are obtained for factors in the range of 0.12−0.18.) The estimate of Cq obtained from the fit to eq 4 is uncertain by its SE, which is obtained from the covariance matrix computed in the course of the matrix operations used to solve the LS equations.19,20,25,26 The square of this SE constitutes an unavoidable “estimation variance” that must be subtracted from the total ensemble variance of Cq in the analysis to obtain the copy number, discussed below. Error Propagation and Copy Number. We use the simple form of error propagation for independent sources of uncertainty,15,27

(4)

in which y0 represents the target fluorescence signal in cycle 0 and Cq is the cycle for which the target signal reaches the specified level yq. bas(x) represents the baseline function; in the analysis of the two technical data sets in ref 4 by two of the compared methods, FPLM and DART,21,22 this was taken to be a 3 parameter saturation function: bas(x) = a − q exp( −rx)

(5)

This form does appear to well approximate the early cycles in those data sets, so we use it here for their analysis. However, bas(x) can be redefined for data where the baseline behavior is different. Several comments about eqs 4 and 5 are appropriate. First, the two forms in eq 4 are statistically equivalent, which is why we have noted [footnote 11 of ref 15] that methods based on eq 1 need not actually obtain Cq to estimate y0: If the second form is used, then the equation, y0 = yq /ECq

(6)

yields the same y0 as the first form, for any choice of yq. Here, we want Cq and use the second form, for which the SE of Cq does depend on yq, being minimal when Cq falls close to the maximum fitted cycle in the early growth region.15 A second point is that, through eqs 4 and 5, we estimate the baseline function and the growth parameters in a single nonlinear fit for each reaction. In almost all methods considered in ref 4, the baseline is estimated separately from the growth parameters, from a fit of just early cycles to the chosen baseline function, which is then frozen (or subtracted) in the analysis of the amplification function. However, this procedure is both unnecessarily complex and statistically inefficient. Monte Carlo simulations, to be discussed elsewhere, confirm that the statistically optimum procedure is to fit all cycles deemed to be represented by the fit model at once, to the combined function, as in eq 4. Of course, there is no guarantee that the baseline function suggested by the early cycles must hold for all cycles; but that is the assumption of the two-step procedures anyway, so it is best accommodated by the one-step combined fit. The use of a constant absolute threshold yq to define Cq is suboptimal when the data vary in scale from rep to rep, as we have noted3,15 in connection with the two technical data sets from ref 4, which do show sizable intensity variation. To compensate for such variation, we take yq to be a specified fraction of the (plateau−baseline) amplitude. To obtain the latter for the 94 × 4 Reps data, it is useful to fit to a function with the amplitude as an adjustable parameter, because many of the reactions never attain a clear plateau. We use the 4parameter log−logistic function, LL4(x) = d[1 + (g /x)h ]−p

(8)

σf 2 =

⎛ ∂f ⎞2 ∑ ⎜⎜ ⎟⎟ σ βi 2 ⎝ ∂βi ⎠

(9)

where β represents the uncertain independent variables. Equation 6 can be rearranged to give Cq as a function of E and y0 (yq being a user-defined constant), and application of eq 9 gives 2

(ln(E)) σCq

2

⎛ σy ⎞2 ⎛ σ ⎞2 = ⎜⎜ 0 ⎟⎟ + Cq 2⎜ E ⎟ ⎝E⎠ ⎝ y0 ⎠

(10)

With some provisos (see below), this expression predicts the effects of variability in y0 and E on the ensemble precision of Cq. [It does not correctly relate estimation uncertainties in y0 and E to those in Cq, because it neglects correlation, and y0 and E are strongly correlated when estimated from single-reaction data by LS fitting.15] Since y ∝ N, σy0/y0 = σN0/N0. For small N0, where Poisson statistics5−7,19 limit the precision with which the sample wells can be charged with the target amplicon, this ratio is 1/(N0)1/2. Neglecting the second term on the right-hand side of eq 10 yields eq 3, from which the copy number is obtained essentially from just the variance in Cq (E being reasonably close to 2), and since sampling estimates of variance follow the statistics of χ2, we can also estimate the precision of N0. The relative SD is (2/ν)1/2, ν being n − 1 in the simplest cases.28 Thus, for example, the 94 replicates for each N0 in Figure 2 suffice to estimate each σCq2 with 15% SD.

(7) C

DOI: 10.1021/acs.analchem.5b00077 Anal. Chem. XXXX, XXX, XXX−XXX

Article

Analytical Chemistry A second, independent contributor to σy0/y0 is pipetting uncertainty. Pipette manufacturers quote uncertainties around 5% for 1 μL delivery volumes; since the uncertainty is roughly constant in absolute volume, this contribution to eq 10 goes inversely as the square of the delivery volume. At 5%, it becomes equivalent to the Poisson contribution for N0 = 400. For comparison, when σE/E = 0.01 and Cq = 25, the last term in eq 10 is equivalent to the Poisson contribution for N0 = 16. Excess Variance in Cq from Variability in E. A closer look at variability in E reveals limitations in eq 10. Consider first random cycle-to-cycle variability in the efficiency. This might result, for example, from variability in the temperature/ time program. We rewrite eq 1 as y = y0 E1E2E3...Ex

on the x-axis and thence to the ensemble SD of Cq but not at all to the cycle-to-cycle variability sensed by the single-reaction LS fits. Equation 16, with σ replacing δ, is equivalent to the second term in eq 10, which thus correctly accounts for ensemble repto-rep variability in y0 and E but not cycle-to-cycle variability in E, for which Cq2 in eq 10 must be replaced by Cq. Amplification Fluctuations. The amplification efficiency E is ideally 2 but is frequently observed to be 1.9−1.95. This means that sometimes an amplicon fails to replicate in a given cycle, and this can contribute significantly to the variance in N in early cycles and hence to that in Cq. Amplification is an either-or event, so it can be treated by the binomial distribution. Suppose E = 1.9. Then, 90% of the time an amplicon replicates, 10% it does not. Taking s as the replication probability (0.9), the variance in the first cycle amplification is N0 s (1 − s),19 leading to a squared relative error in N1 = N0 E of (E − 1)(2 − E)/(N0E2). Similar considerations hold for cycles 2, 3, ..., q, and the contributions are independent, leading to a cumulative squared relative error in Nq that is the sum of the contributions for all earlier cycles. In practice, only the first 5−7 cycles (where N remains relatively small) contribute significantly to the relative error in Nq; extending the sum to infinity yields a simple result for this amplification fluctuation (AF) variance:

(11)

to emphasize that the Ei values are independent with respect to E variability. In keeping with the simple growth equation, we suppose all have the same nominal value E and uncertainty σE. Then, the rhs of eq 9 has x identical terms, giving σy 2 = y 2 x(σE /E)2

(12) 1/2

from which (σy/y) = x (σE/E). This represents a variability in the target signal y at cycle x. We want the resulting variability in x = Cq when y = yq. This is given by the projection of σy onto the x axis at a specific x value x(y), −1

σx(y) = σy(dy/dx)

σAF 2 Nq

(13)

For x = Cq, the effective uncertainty is thus σCq =

(14)

For example, if σE = 0.01, E = 2, and Cq = 25, this source yields σCq = 0.036, but the cycle-to-cycle variability is 5 times smaller so would appear to be much less significant from the standpoint of the LS curve fitting. In effect, this random error becomes “baked into” the curve during the baseline cycles, leading to a variability in the ensemble statistics of Cq that greatly exceeds the LS predicted errors. This is an unusual case where random error manifests as apparent systematic error for individual samples but is randomized over replicates. Next, suppose the Ei values in eq 11 are subject to systematic error, perhaps from variations in the chemical charging of the different wells belonging to a replicate set. Systematic error is of fixed sign and magnitude, presumably discoverable by suitable measurements. As before, let all Ei values be the same, now with systematic error δE. In place of eq 9, we have19 δf =

⎛ ∂f ⎞ ⎟⎟δβi ⎝ ∂βi ⎠

∑ ⎜⎜

xδE E ln E

2−E N0E

(17)



RESULTS AND DISCUSSION Figure 3 shows analysis results for the first 94 × 4 Reps reaction obtained using the fit model of eqs 4 and 5. The complete results (to right) illustrate the equivalence of the two forms in eq 4, while the abbreviated results show how Cq and its SE vary with yq, the SE becoming minimal when Cq is near the maximum cycle included in the fit. If the fit range is decreased by one cycle, to 21, “Chisq” remains about the same, as do the other parameters, but the SEs for the growth parameters roughly double. Increasing the fit range to 23 cycles doubles Chisq, while adding another cycle yields another 6-fold increase. These results support limiting the fit range for this model to the cycle nearest the SDM for these data, as recommended by the authors of the FPLM method.4,21

(15)

which comes from the exact differential df for f(β). In place of eq 12, we have δy/y = x (δE/E), giving for the effective displacement at x(y),

δx(y) =

=

This contribution thus has the same dependence on N0 as the Poisson contribution, vanishing as E → 2; for E = 1.9, it amounts to 5% of the Poisson preparation variance. Data. In the results discussed below, we emphasize the 94 × 4 Reps data set provided in the Supplement to ref 4. This data set contains an unusually large number of replicates for each of four 10-fold dilutions, starting with 15 000 copies of the MYCN gene. The raw intensity data are provided for all reactions (Figure 2), and all but one of these (# 150_28) yielded usable results on analysis with our programs. [The data in the Methods Supplement are labeled inversely from their concentrations; for example, the first reaction, labeled 15_1, actually belongs to the highest concentration, having N0 = 15 000.] We have also analyzed several other replicate data sets available to us (and downloadable at www.dr-spiess.de) for which absolute copy numbers are approximately known: 20 replicates at each of 6 dilutions from Rutledge and Cote (minimum N0 = 417),29 12 × 7 from Guescini et al. (N0 = 31),30 and 18 × 5 from Lievens et al. (N0 = 160).31

Cq σE E ln E

2

(16)

If this systematic error is randomized over the replicates, the δs are replaced with σs, and the result differs from eq 14 just by replacement of (Cq)1/2 by Cq. Thus, for Cq = 25 and the same σE, this effect would be 5 times larger than the random cycle-tocycle error. It contributes to the dispersal of the replicate curves D

DOI: 10.1021/acs.analchem.5b00077 Anal. Chem. XXXX, XXX, XXX−XXX

Article

Analytical Chemistry

Having variance data for the four concentrations in the 94 × 4 Reps set, we can conduct a more complete analysis than that given just after eq 3. We first subtract the estimation variance from the total to get the excess variance, which we then fit to variants of eq 10, in which the term in (σE/E)2 accounts for repto-rep or cycle-to-cycle (eq 14) variability in E. To cast eq 10 into the y = f(x) form, we take σy0/y0 to be determined entirely by Poisson uncertainty and let A = N0 for the most dilute reps and B = σE. Then, the 4 N0 values are A × 10m for m = 0−3. Defining y′ as σexcess2 × 10m, we have x′ = Cq × 10m for cycle-tocycle variability in E, Cq2 × 10m for rep-to-rep. The fit model then becomes ⎞⎡ 2 ⎛ ⎛ B ⎞2 ⎤ 1 ⎜ ⎟ ⎥ ⎢ x + ′ y′ = ⎜ ⎟ ⎝E⎠ ⎦ ⎝ (ln(E))2 ⎠⎣ A × E

Figure 3. LS results for analysis of 94 × 4 Reps reaction MYCN_STDA15_1 using eqs 4 and 5 as fit model. The complete results to the right illustrate the equivalence of the two forms of eq 4; the abbreviated results to the left are equivalent to those at the top right, showing how Cq (= Ct here) and its SE (“Error”) vary when yq is changed from the value of 96.4 used for the FPLM method on these data in ref 4; all other results remain the same as yq is changed. Cycles 1−22 are included in the fit, the last being the approximate SDM for this reaction. In these unweighted fits, the quantity "Chisq" is the sum of squared residuals.

(18)

in which the first term in square brackets now includes the contributions from amplification fluctuations (AF). Results (Figure 5) favor cycle-to-cycle E variability and indicate 1.0%

Figure 4 compares results from the present treatment with results from ref 4, showing that the use of a relative threshold

Figure 5. Analysis of excess variance in Cq for the 94 × 4 Reps data set, using eq 18 for the two alternative definitions of x corresponding to the two considered sources of variability in E. E is taken as 1.987, from a linear fit of Cq vs log(N0). The error bars (σ) are (2/93)1/2 times the estimated total variances, and Chisq (χ2) is the sum of weighted squared residuals, with weights taken as σi−2.

uncertainty in E. Since some of the excess variance is now attributed to E variability and AF, the estimates of N0 are larger than from the simpler treatment by eq 3, but they are still statistically consistent with the nominal value. In the analysis of Figure 5, we have ignored pipetting uncertainty. Using the stated 3 μL volumes4 and the pipet manufacturer’s < 6% quote at 1.0 μL, we can correct for this in the fit assuming 2%. The effect is no change in N0 and a small reduction in σE (to 0.0187 in the first fit, for example). However, if this error is as large as 5%, it renders σE insignificant in both models, while also reducing the fit χ2 value to 7.0 and N0 to 15.9. This result means that we can obtain a better fit to these data assuming pipetting uncertainty and no E variability. Lacking better information about the pipetting uncertainty, we must designate the conclusions about E variability as tentative. The data sets from refs 29−31 have far fewer replicates, so it is of interest to see how they perform when submitted to this analysis. Figure 6 shows the most extreme test of the method, on the 20 × 6 replicate data of Rutledge and Cote, for which the minimum N0 is stated to be 417. Analysis of the excess variance by eq 18 yields a good fit (χ2 being close to the value of 4 expected for the 4 degrees of freedom), but N0 is

Figure 4. Standard deviations in Cq from the present reanalysis, as a function of initial amplicon concentration, compared with the best and worst results from ref 4, from statistics obtained in ref 3. All results are ensemble SDs from the 94 reps for each concentration, except the last, which are the root-mean-square values from the present LS standard errors. For the relative threshold (solid squares), the quantity d is the amplitude parameter from eq 7.

(yq = 0.15 d) yields precision comparable to the best from ref 4.3 Compared with the previous FPLM results, the better performance using the absolute threshold yq = 700 is partly from a better choice of yq (Figure 3) but more from the onestep fit in place of the two-step procedure of the classic FPLM method. (See the Supporting Information for more detailed comparisons.) For all concentrations, the ensemble SDs greatly exceed the rms SEs, by minimally a factor >4. This means that real variability in Cq dominates in σCq2. Also, since the Poisson contribution is essentially negligible at the highest two concentrations, it appears that variability in E and/or pipet volume uncertainty contribute significantly to the total Cq variance. E

DOI: 10.1021/acs.analchem.5b00077 Anal. Chem. XXXX, XXX, XXX−XXX

Article

Analytical Chemistry

variability; however, this result is based on the assumption that pipetting volume uncertainty is negligible, so it must be considered tentative. The four data sets we have analyzed all involve dilution series with 4 or more concentrations. This has permitted simultaneous LS estimation of N0 and the ensemble σE. Also, the multiple concentrations permit estimation of E, which is needed for the other quantities, but to which the estimate of N0 is not highly sensitive. It is noteworthy that the N0 estimates have nominal uncertainty