Anal. Chem. 1996, 68, 1941-1947
Interlaboratory Comparison of Autoradiographic DNA Profiling Measurements. 3. Repeatability and Reproducibility of Restriction Fragment Length Polymorphism Band Sizing, Particularly Bands of Molecular Size >10K Base Pairs Adam M. Stolorow, David L. Duewer, and Dennis J. Reeder*
Chemical Science and Technology Laboratory, National Institute of Standards and Technology, Gaithersburg, Maryland 20899 Eric Buel
State of Vermont Forensic Laboratory, Waterbury, Vermont 05676 George Herrin, Jr.
Division of Forensic Sciences, Georgia Bureau of Investigation, Decatur, Georgia 30034
The observed interlaboratory standard deviation (SD) associated with the restriction fragment length polymorphism (RFLP) measurement of DNA fragment size is a predictable function of the observed mean band size (MBS). For DNA fragments of size 1000 base pairs (bp) to the largest resolved component of commonly used “sizing ladder” calibration materials (about 20 000 bp), the variation in the sizing data from the Technical Working Group on DNA Analysis Methods (TWGDAM)-sponsored “Large Fragment Study” is well-described by SD ) 7.5 (1 + MBS/19 500)7.1. This sizing variability arises from a 0.1-0.4% SD in the relative positions of sample and calibration bands among electrophoretic gels. Statistically significant sizing differences do exist for bands above 10 000 bp among laboratories that use different calibration materials. The Large Fragment Study was efficiently accomplished through the use of a designed set of DNA samples, requiring but one gel in each of 20 participating laboratories. We report here an interlaboratory comparison of autoradiographic DNA profiling measurements. In part 1 of this series, we demonstrated that the restriction fragment length polymorphism (RFLP) protocol in common forensic use throughout North America produces reliable, reproducible results that are comparable among laboratories.1 When given identical samples, laboratories that follow this protocol and appropriately monitor their results (with control and reference materials, internal quality assurance programs, and external proficiency demonstrations) produce consensus measurement values of the molecular size of DNA fragments (“bands”) within a small, predictable standard deviation (SD). In part 2, we showed that (1) the intrinsic oneSD band sizing uncertainty resulting from image analysis can be (1) Part 1 of this series: Mudd, J. L.; Baechtel, F. S.; Duewer, D. L.; Currie, L. A.; Reeder, D. J.; Leigh, S.D.; Liu, H.-K. Anal. Chem. 1994, 66, 3303-3317. S0003-2700(95)01138-3 CCC: $12.00
© 1996 American Chemical Society
attributed to a 0.1% SD in the measurement of the relative position of sample and calibration bands and (2) the observed interlaboratory band sizing SD and its dependence on the estimated mean band size (MBS) can be described as a 0.2-0.3% SD in the relative position of sample and calibration bands.2 In part 4, we will document the magnitude and sources of the small but systematic biases in estimating band sizes observed among laboratories.3 Due to the paucity of very large and very small DNA fragment size estimates in the available data, the above conclusions were limited to bands of size 1000-10 000 base pairs (bp). The relative scarcity of DNA fragment band sizes larger than 10 000 (10K) bp in forensic casework creates a dilemma: while the presence of an infrequently observed band in a RFLP profile inherently should increase discriminatory power, the absence of accepted “matching criteria” for bands g10K bp has led many forensic laboratories to not use or report such bands, thus decreasing the discriminatory power of the profile as a whole. This problem was addressed in mid-1993 through studies designed to extend the quantitative description of RFLP measurement uncertainties as a function of band size to DNA fragments of size g10K bp. Technical Working Group on DNA Analysis Methods (TWGDAM) laboratories were surveyed for casework and/or proficiency data they had accumulated on repeatedly sized “large” DNA fragments. While all the resulting repeatability (short-term, within-laboratory) statistics were qualitatively consistent with previously established relationships, no clear quantitative trend could be identified for the g10K bp bands, nor could these singlelaboratory analyses of different samples fully characterize the desired reproducibility (long-term, among-laboratory) variation. (2) Part 2 of this series: Duewer, D. L.; Currie, L. A.; Reeder, D. J.; Leigh, D.; Liu, H.-K.; Mudd, J. L. Anal. Chem. 1995, 67, 1220-1231. (3) Part 4 of this series: Duewer, D. L.; Currie, L. A.; Reeder, D. J.; Leigh, D.; Filliben, J. J.; Liu, H.-K.; Mudd, J. L. Interlaboratory Comparison Autoradiographic DNA Profiling Measurements. 4. Protocol Effects, preparation.
S. S. of in
Analytical Chemistry, Vol. 68, No. 11, June 1, 1996 1941
We concluded that interlaboratory data from common samples were needed to provide the required quantitative information. Examination of the component spacing of the two most commonly used commercial RFLP calibration materials (“sizing ladders”) suggested that ∼20 well-distributed bands would serve to characterize sizing performance from 4000 to 24 000 bp. This range spans from previously available data to slightly larger than the largest band of either ladder. An experiment designed to maximize the number of participants by minimizing the resource demands on individual participants was developed in early 1994 in a cooperative effort between the National Institute of Standards and Technology (NIST) and the State of Vermont Forensic Laboratory. Six samples identified in the survey were provided to interested TWGDAM laboratories for analysis. Participants sent their results, along with copies of their autoradiograms, to NIST. All autoradiograms were independently reanalyzed at NIST. This report presents the results of the TWGDAM Large Fragment Study and extends and quantifies the relationship between interlaboratory band sizing SD and MBS. These data permit the direct description of expected sizing SD attributable to interlaboratory reproducibility and to computer imaging. Additionally, it demonstrates that there are differences in DNA fragment sizing attributable to the choice of sizing ladder. Comparison of the Large Fragment Study data with data from the initial survey and from an exceptionally large and highly replicated quality assurance data set from the Georgia Bureau of Investigation (GBI) demonstrates that most of the interlaboratory reproducibility SD arises from intralaboratory gel-to-gel differences. METHODS AND MATERIALS We have presented interim reports to the TWGDAM community at various times during the course of our studies on the functional relationships between band size and sizing uncertainty. Following a June 1993 discussion of the need for more data to adequately establish the relationship between interlaboratory reproducibility standard deviation and mean band size beyond 10K bp, representatives of several laboratories volunteered to help gather the necessary data. Eric Buel of the State of Vermont Forensic Laboratory undertook the coordination of these efforts. TWGDAM Large Fragment Survey Data. The TWGDAM community was requested to provide replicate sizing data of large fragments from their records. Table 1 lists the laboratories that provided information. Sufficiently replicated (eight or more times, our cutoff in part 1 of this series) data were provided for a total of 80 bands; 14 of these bands were of size g10K bp. The number of replications ranged from 9 to 46; the nature of the replications ranged from multiple sizings of a single autoradiogram to single sizings of multiple gels. None of the samples were analyzed in more than one laboratory. TWGDAM Large DNA Fragment Study. Six individuals whose DNA provides large DNA fragments were selected using information collected during the TWGDAM survey. EDTApreserved blood from these individuals was obtained in standard blood tubes. Samples were provided as dried 200-µL aliquots on bloodstain cards (Life Technologies, Gaithersburg, MD) to all interested laboratories; the 20 Canadian, State, and local laboratories that returned data are also listed in Table 1. Participants were asked to extract, quantify, and run each sample in three separate lanes on one analytical gel using their individually validated protocols. Each sample lane combined two designated 1942 Analytical Chemistry, Vol. 68, No. 11, June 1, 1996
Table 1. Participants in Large Fragment Survey and Study agency Arizona Department of Public Safety, Crime Laboratory Broward County, Florida, Crime Laboratory California Department of Justice, DNA Berkeley Laboratory Florida Department of Law Enforcement, Jacksonville Florida Department of Law Enforcement, Tallahassee Georgia Bureau of Investigation Illinois State Police, Bureau of Forensic Sciences Kentucky State Police Metropolitan Dade County, Florida, Police Department Minnesota Bureau of Criminal Apprehension Missouri State Highway Patrol North Carolina State Bureau of Investigation Ontario Centre of Forensic Sciences Orange County, California, Sheriff-Coroner Department Oregon Department of State Police, Forensic Laboratory Pennsylvania State Police, Greensburg Regional Laboratory Royal Canadian Mounted Police, Central Forensic Laboratory South Carolina Law Enforcement Division U.S. Army Central Identification LaboratorysContinental U.S. Vermont State Forensic Laboratory Virginia Division of Forensic Science, Central Laboratory Washington State Patrol Crime Laboratory
survey
study
X
X
X
X
X X
X X
X
X X X X
X X
X X X X
X X X
X
X
X
X
X X
X X
X X X
DNA extracts from the six blood donors. The three extract pairs were termed “A” (donors KLM and RJH), “B” (RLD and LMM), and “C” (JC and EB). These sample mixes were chosen to provide four bands per lane, without any band overlap. This design provided (6 × 2 sample alleles) × 2 probes ) 24 bands run in triplicate. Additionally, all participants provided (1 × 2 K562 alleles) × 2 probes ) 4 control bands run once. (K562 is an immortalized female human cell line. The original K562 cell line is maintained at the American Type Culture Collection, Rockville, MD.) The K562 bands were termed “K”. Details on these 28 bands are listed in Table 2. Each of the three mixed extracts was run in three separate preassigned lanes, using a casework-style gel (see Figure 1). The membrane resulting from a Southern blot4 of the analytical gel was probed at genetic loci D1S7 (“D1”) and D4S139 (“D4”) using 32P-radiolabeled single-locus probes. The autoradiograms were evaluated, and DNA fragment sizes were estimated by the individual laboratories using their in-house imaging systems. Participants were asked to perform at least two independent image analyses. Some labs reported two sizings by the same analyst, while others provided one sizing by each of two analysts. One lab reported two sizings by each of four analysts. A copy of each autoradiogram was forwarded to NIST for evaluation and sizing by the authors (A.M.S.) on a BioImage workstation using Whole Band Analysis, Version 2.4 software (BioImage Inc., Ann Arbor, MI). All participating labs generally follow a very similar HaeIII restriction endonuclease DNA digestion protocol developed by (4) Southern, E. M. J. Mol. Biol. 1975, 98, 503-517.
Table 2. Repetitions, Size, and Standard Deviation of Sample Bands banda
ngels
nimage
MBS
SDAR
SDlab
SD
D4:C:1 D1:C:1 D4:A:1 D4:B:1 D1:B:1 D1:A:1 D1:C:2 D4:C:2 D1:B:2 D1:A:2 D1:A:3 D4:C:3 D4:A:2 D4:C:4 D4:K:1 D4:B:2 D4:A:3 D1:B:3 D4:A:4 D4:B:3 D1:K:1 D1:C:3 D1:K:2 D4:B:4 D4:K:2 D1:B:4 D1:C:4 D1:A:4
19 18 19 18 20 19 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
180 165 183 174 196 189 189 198 198 198 198 198 198 198 66 198 198 198 198 198 66 195 66 198 66 198 195 198
19 814 17 933 17 699 16 969 14 760 13 696 11 873 10 040 9 675 9 612 8 148 8 025 7 768 7 215 6 512 6 293 6 234 5 870 5 395 4 593 4 582 4 245 4 231 4 160 3 448 3 279 2 893 2 193
555 454 505 369 305 157 97 68 65 76 45 40 38 30 27 31 25 41 18 19 12 15 13 14 9 12 10 8
950 668 657 614 251 149 171 113 100 95 63 73 65 57 49 63 51 56 39 34 30 33 30 30 15 24 18 13
1100 808 828 717 395 216 197 132 119 122 78 83 75 64 56 70 57 70 43 39 32 36 33 33 18 27 20 15
interlaboratory reproducibility standard deviation (SDlab), and combined total standard deviation (SD). Bands are labeled according to probe, sample, and position in lane. For example, band D1:C:3 is the third largest band of sample C probed at locus D1S7. MBS gives equal weight to each gel regardless of the number of reported sizings:
(5) Budowle, B.; Baechtel, F. S. Appl. Theor. Electrophor. 1990, 1, 181-187.
i
j
ij
MBS )
i
ngels
where bpij represents a particular sizing of a given band on autoradiogram i, and ni is the number of bpij recorded. SDAR is a measure of average within-autoradiogram dispersion:
x
ngels
SDAR )
∑(n - 1)SD i
2
i
i
nimage - ngels
where SDi is the simple standard deviation for the ni sizings from autoradiogram i.6 SDlab is a measure of among-laboratory measurement dispersion:
x
ngels
SDlab )
RESULTS AND DISCUSSION TWGDAM Large Fragment Data. Figure 1 illustrates the typical gel format used in this study. Table 2 provides the following summary information for all 28 bands: mean band size (MBS), number of electrophoretic gels on which each sample band was found (ngels), total number of times each band was sized (nimage), intraautoradiogram repeatability standard deviation (SDAR),
ni
∑(∑bp /n )
a Code for specific DNA fragment: locus (D1S7 or D4S139), sample (A, B, C, or K), and position in lane (1st, 2nd, 3rd, or 4th).
the FBI, with minor individualized modifications.5 Participants provided us with a list of specific testing factors. Examination of the effects of all such testing factors, with the exception of choice of sizing ladder, is beyond the scope of this paper. Georgia Bureau of Investigation Quality Assurance Tests. The Georgia Bureau of Investigation (GBI) includes DNA samples yielding fragments of relatively large size in their routine quality assurance program. By the time the data collection stage of the TWGDAM Large Fragment study was nearing completion, GBI was able to supply data on 141 bands from 15 different samples probed at five genetic loci. The band sizes range from 800 to 15 000 bp, with 22-70 replicate values per band. These data were collected from 228 different proficiency and casework gels over a period of about 2 years. Disclaimer: Certain commercial equipment, instruments, and materials are identified in this report to specify adequately the experimental procedure. Such identification does not imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the materials or equipment identified are necessarily the best available for the purpose.
ngels
ni
∑(MBS - ∑bp /n )
2
ij
i
i
j
ngels - 1
SD combines the within-autoradiogram and among-laboratory dispersion components:
SD ) xSDAR2 + SDlab2 Intraautoradiogram Repeatability and Interlaboratory Reproducibility. A repeatability standard deviation characterizes independent results obtained using “the same method on independent test items in the same laboratory by the same operator using the same equipment within short periods of time”.7 For this study, we apply this formal definition to the triplicate analysis of samples within each gel, combined with the variable number of replicate sizings of each resulting autoradiogram. SDAR thus combines the variability expected for a single sample among lanes within a given gel and the variability of replicate imaging of a given autoradiogram. This is in accord with the definition used in the TWGDAM Phase 1a and 1b image analysis studies.1 A reproducibility standard deviation characterizes independent results obtained by “different laboratories, different operators, and different equipment”.7 Here we apply this definition to the average band sizes for each sample in the 20 sets of autoradiograms. Figure 2 reveals the dependence of SDlab and SDAR on MBS. The data discussed in ref 1 and the expected relationships discussed in ref 2 are shown for comparative purposes.1,2 SDAR (6) Korn, G. A.; Korn, T. M. Mathematical Handbook for Scientists and Engineers; McGraw-Hill: New York, 1968; Section 19.6-6. (7) International Organization for Standardization. StatisticssVocabulary and Symbols; ISO 3534-1; ISO: Geneva, Switzerland, 1993; Definitions 3.14 -3.25.
Analytical Chemistry, Vol. 68, No. 11, June 1, 1996
1943
Figure 1. Typical gel format for (a) D1S7 and (b) D4S139 loci probes. Calibration ladder bands are in lanes 1, 5, 9, and 14. The two K562 cell line control DNA bands are in lane 2. The four sample A (a 1:1 mixture of DNA extracted from sources KLM and RJH) bands are in lanes 3, 10, and 12; sample B (RFL and LMM) in lanes 4, 8, and 11; and sample C (JC and EB) in lanes 6, 7, and 13. Only the upper portion of each gel is shown.
has the same magnitude and behavior as observed in the TWGDAM 1b Precision Study. The intraautoradiogram sizing variability is explained by a 0.05-0.1% (less than one resolution unit of a 1024-pixel-long digitized image) variability in measuring the relative position of sample and sizing ladder bands. SDlab also has the same magnitude and behavior as seen previously with the TWGDAM interlaboratory studies data and is explained by a 0.1-0.4% variability in the relative positions of sample and calibration bands across different electrophoretic gels. The relationship derived in ref 2 is thus extended to bands from 10 00020 000 bp (see eq 1, below). However, this relationship holds only so long as one is interpolating between resolved sizing ladder bands. The uncertainty associated with extrapolating beyond the largest resolved ladder band has not been characterized but is clearly large. 1944
Analytical Chemistry, Vol. 68, No. 11, June 1, 1996
There were no gross discrepancies between the band sizes reported by participants and those determined by the NIST analyst. A more detailed analysis of analyst-specific effects and the influence of specific protocol modifications will be presented elsewhere. Intralaboratory Casework Repeatability. The SDs calculated for all bands collected during the Large Fragment Survey are also displayed in Figure 2. Note that the data occupy the entire region between the above-described relationships for intraautoradiograph imaging repeatability and intergel reproducibility. Reexamination of the original data reveals that, indeed, the SDs near the imaging repeatability boundary represent multiple sizings of one or a few autoradiographs and the SDs near the intergel reproducibility boundary represent single sizings of autoradiographs from different gels. Most of the SDs between
Figure 2. Band sizing standard deviation (SD) as a function of mean band size (MBS). TWGDAM Phase 1b interlaboratory autoradiogram imaging reproducibilities (summarizing the multiple sizings of the same autoradiograms in different laboratories) are denoted “I”; all intra- and interlaboratory sizing reproducibilities described in ref 1 are “O”; intraautoradiogram repeatabilities (SDAR) from the experimental phase of the Large Fragment Study are “R”; interlaboratory reproducibilities (SDlab) are “9”; long-term repeatabilities from Georgia Bureau of Investigation data are “[”; and the mixed-type SDs for the Large Fragment Survey data are “1”. The curves were calculated from eq 1 using the following σRmin: 0.05% (long-dashed curve), 0.1% (shortdashed curve), 0.2% (dotted curve), and 0.4% (solid curve). Table 3. Repetitions, Size, and Standard Deviation of GBI Bands band
ngels
MBS
SD
BA124 S5 BA126 BA095 BA094 BA095 BA123
29 45 41 37 36 35 26
10 216 11 121 11 709 12 948 14 087 14 520 14 813
187 160 220 375 275 381 399
the two bounds represent a mixture of replicate sizings and different autoradiographs. These data confirm that the relationships apparent in the interlaboratory Large Fragment Study data do reflect the real world. Intralaboratory Long-Term Repeatability. The GBI data more fully document the sizing variability expected within a single laboratory over time. Table 3 summarizes results for the seven GBI bands of size g10K bp. Figure 2 plots the results for all GBI quality assurance bands. These intralaboratory repeatability SD values exhibit nearly the same relationship to band size as do the interlaboratory reproducibility SDlab values. The slight difference in curvature between the repeatability and reproducibility SDs is attributable to laboratory-specific protocol modifications, as discussed in ref 3. The strong overlap of the long-term single-laboratory repeatability and interlaboratory reproducibility SD values confirms that
Figure 3. Student’s t results for band size differences attributable to use of different sizing ladders. Symbols represent two-sample Student’s t test for mean difference of band size as determined from different calibration ladders, assuming equal variances. The dotted lines represent the critical 95% confidence two-tailed Student’s t value.
much of SDlab is attributable to intralaboratory variability over time. Further, the overlap demonstrates that biases among laboratories are small enough that it may be useful to consider the SDlab data as if they were 20 sets from one lab, rather than one set from 20 labs. Sizing Ladder Differences. Two commercial sizing ladders were used in the study. Figure 3 plots Student’s t results for paired differences in MBS between users of the two ladders. Only two of the 28 sample bands show statistically significant differences: D1:C:3 (a 1% difference in band sizes) and D1:C:2 (2.5%). Figure 4 shows the position of the sample bands in comparison to both ladders. The D1:C:3 band of MBS 4239 bp is in the region known to be affected by an anomalously moving band in one of the ladders.3 The D1:C:2 band of MBS 11 886 bp aligns with the 11 919 bp band of one of the ladders, while it is bracketed by the 10 086 and 15 004 bp bands of the other. The observed difference for this band results from the interaction of ladder differences and the use of two-band interpolation. Such extremely “local” interpolation algorithms are very sensitive to any nonlinearity in the underlying calibration. More global approaches that use additional bands above and below the two bracketing ladder bands may allow more accurate interpolation of the questioned band.8 Predicting Interlaboratory Sizing Uncertainty. In ref 2, we described band sizing SD as
SD ) σRA(1 + bp/C)B
(1)
where: σR is one standard deviation of the measurement of the relative position of sample and sizing ladder bands, and A, B, and C are nonlinear least-squares regression estimates of the coef(8) Elder, J. K.; Southern, E. M. Computer-Aided Analysis of One-Dimensional Restriction Fragment Gels. In Nucleic Acids and Protein Sequence Analysis; Bishop, M. J., Ralings, C. J., Eds.; IRL Press: Oxford, 1987 pp 165-172.
Analytical Chemistry, Vol. 68, No. 11, June 1, 1996
1945
Figure 4. Relative location of D1:C:2 and D1:C:3 and sizing ladder component bands. The two ladders depicted were used by participants in the TWGDAM Large Fragment Study. The nominal band sizes of selected ladder components are displayed to the outside of each ladder.
ficients characteristic of sigmoidal calibration of band size to electrophoretic migration distance. The parameter estimates for an “average” set of demonstration data were found to be A ) 1450, B ) 2.9, and C ) 3600. The measurement standard deviation, σR, was approximated as a constant minimum value, σRmin, plus a location-within-gel-dependent component: σR ) σRmin, -0.0566 + 0.008 log10(bp) + 0.1/log10(bp). The TWGDAM Large Fragment Study provides sufficient data, sufficiently well distributed in size, to permit direct parametrization of eq 1. Combining the constant component of σR into the A term and assuming that the functional dependence of σR on bp can be included in the B and/or C term curvature, the following transformation facilitates stable numerical estimation of the parameters:
log10 SD ) A′ + B′ log10(1 + MBS/C′)
(2)
Using the 28 (MBS, SD) interlaboratory data of the Large Fragment Study in combination with GBI intralaboratory data of band size