Interlaboratory Comparison of Autoradiographic DNA Profiling

P, sizing program source, multiple sources: a, b, c .... robust estimate of set location; the interquartile range (IQR, the distance between the top a...
4 downloads 0 Views 444KB Size
Anal. Chem. 1997, 69, 1882-1892

Interlaboratory Comparison of Autoradiographic DNA Profiling Measurements. 4. Protocol Effects David L. Duewer, Lloyd A. Currie, and Dennis J. Reeder*

Chemical Science and Technology Laboratory, National Institute of Standards and Technology, Gaithersburg, Maryland 20899 Stefan D. Leigh, James J. Filliben, and Hung-Kung Liu

Computing and Applied Mathematics Laboratory, National Institute of Standards and Technology, Gaithersburg, Maryland 20899 James L. Mudd

Forensic Science Research and Training Center, Laboratory Division, Federal Bureau of Investigation Academy, Quantico, Virginia 22135

The observed total interlaboratory uncertainty in restriction fragment length polymorphism (RFLP) measurements is sufficiently small to be of little significance given current forensic needs. However, as the number of RFLP data increase, further reduction in the total uncertainty could help minimize the resources required to evaluate potential profile matches. The large number of data available enable quantitative estimation of the withinlaboratory imprecision and among-laboratory bias contributions to the total uncertainty. Some small but consistent among-laboratory measurement biases can be attributed to specific procedural or materials differences. The bias direction is often fragment-specific and thus unpredictable for unknown samples. Actions that would minimize currently recognized sources of interlaboratory bias include the following: (1) all laboratories should use the same algorithm for data interpolation, (2) all laboratories should use the same sizing ladders, (3) each laboratory should prepare control DNA and sample DNA in the same manner and with the identical reagents, (4) all laboratories should adopt a uniform policy on ethidium bromide use, and (5) all laboratories should adopt the same control DNA sizing acceptability criteria. Given identical samples, ideal measurement systems will produce indistinguishable results regardless of when, where, or by whom the analysis is performed. Real chemical measurement systems seldom, if ever, attain this ideal. While replicate measurements made in a given laboratory often are normally distributed about a stationary average, different laboratories typically have somewhat different stationary averages. These among-laboratory differences (biases) can sometimes be associated with specific measurement protocol differences.1-4 (1) International Organization for Standardization. Precision of Test MethodssDetermination of Repeatability and Reproducibility for a Standard Test by Inter-laboratory Tests; ISO 5725; ISO: Geneva, Switzerland, 1986. (2) American Society for Testing and Materials. Standard Practice for Conducting an Interlaboratory Study to Determine the Precision of a Test Method; ASTM E 691-87; ASTM: Philadelphia, PA, 1987.

1882 Analytical Chemistry, Vol. 69, No. 10, May 15, 1997

In part 1 of this series,5 we demonstrated that the restriction fragment length polymorphism (RFLP) procedures used by the members of the Technical Working Group for DNA Analysis Methods (TWGDAM) permit reliable measurement of the molecular size of DNA fragments (“bands”). The basic analytical procedure involves digestion of extracted DNA with the restriction endonuclease HaeIII, electrophoretic separation of the resulting bands within an agarose slab gel, Southern blot immobilization of the separated fragments, hybridization with radiolabeled specific sequences of DNA (probes), autoradiographic visualization of the labeled DNA fragments hybridized to their specific complement (loci), determination of the relative position of calibration (sizing ladder) and sample bands on the autoradiogram, and calculation of the apparent molecular weight (size) of the sample bands expressed as the number of base pairs (bp).6-8 For laboratories that appropriately monitor measurement performance (using control and reference materials, internal quality assurance programs, and external proficiency tests), this procedure generates a distribution of band sizings having a characteristic average (mean band size or MBS) and a standard deviation (SD) that is a simple function of band size. In parts 2 and 3 of this series9,10 we demonstrated that the observed total interlaboratory band-sizing SD, SDtot, is well described as a few parts-per-thousand variability in the relative locations of sample and sizing ladder bands, propagated through a sigmoidal calibration function relating electrophoretic migration distance and band size. As SDtot includes both within-laboratory (3) International Organization for Standardization. StatisticssVocabulary and Symbols; ISO 3534-1; ISO: Geneva, Switzerland, 1993. (4) Hamaker, H. C. Repeatability and Reproducibility: Some Problems in Applied Statistics. In Design, Data & Analysis by Some Friends of Cuthbert Daniel; Mallows, C. L., Ed.; John Wiley & Sons: New York, 1987; pp 71-92. (5) Mudd, J. L.; Baechtel, F. S.; Duewer, D. L.; Currie, L. A.; Reeder, D. J.; Liu, H.-K.; Leigh, S. D. Anal. Chem. 1994, 66, 3303-3317. (6) Southern, E. J. Mol. Biol. 1975, 98, 503-517. (7) Budowle, B.; Baechtel, F. S. Appl. Theor. Electrophor. 1990, 1, 181-187. (8) Guidelines for a Quality Assurance (QA) Program for DNA Analysis. Crime Lab. Dig. 1991, 18, 44-75. (9) Duewer, D. L.; Currie, L. A.; Reeder, D. J.; Leigh, S. D.; Liu, H.-K.; Mudd, J. L. Anal. Chem. 1995, 67, 1220-1231. (10) Stolorow, A. M.; Duewer, D. L.; Reeder, D. J.; A. M.; Buel, E.; Herrin, G., Jr. Anal. Chem. 1996, 68, 1941-1947.

S0003-2700(96)01070-0 This article not subject to U.S. Copyright. Publ. 1997 Am. Chem. Soc.

precision and among-laboratory bias components of measurement uncertainty, for a 14-cm electrophoretic gel the biases are manifest as submillimeter differences in relative band location. Such small differences among laboratories are of little significance for current forensic practice where “match windows” of (2.5% band size (1-3.5 mm of a 14-cm gel) are used to compare bands from different gels. Nonetheless, analysis of these small biases facilitates the continued improvement of RFLP measurement systems. As RFLP technology matures and individual laboratories seek to optimize their systems, the ability to control the factors giving rise to measurement differences will ensure that today’s data will be useful for the lifetime of this technology. The ability to compare DNA profiling measurements efficiently across laboratories and over time is not an abstract concern. National capabilities for storing and comparing results from biological evidence recovered from crime scenes are being implemented, facilitating identification of serial felons across forensic jurisdictions.11 The TWGDAM studies that form the basis for this series of papers were not designed to reveal specific information about differences in procedures among the participating laboratories. However, the pattern of minor differences among the laboratories was sufficiently consistent and the quality of the data sufficiently high that a number of sources of RFLP measurement bias can be identified. In the following sections, we (1) describe the data and methods used in our analyses, (2) more fully document the overall magnitude of interlaboratory biases, and (3) describe the components of the analytical process that are responsible for the documented biases. The existence and magnitude of several of these protocol effects are confirmed through extended analysis of relevant historical data. METHODS AND MATERIALS Two recent interlaboratory studies documented many aspects of the basic RFLP procedure used in TWGDAM laboratories. The TWGDAM phase 2 Precision Study was initiated in April 1990. This study provided 23 sets of band-size results at loci D2S44 and D17S79 for each laboratory’s K562 cell line control DNA, FBIprovided DNA extracts of a single lot of K562 DNA, and FBIprovided DNA extracts from source “JM”. An expanded phase 3 study, encompassing the complete RFLP process, was initiated in August 1990. It provided 22 sets of results at the above two loci for each laboratory’s K562 DNA and five sets of FBI-provided blood stain samples, one of which was from source JM. In both studies, several laboratories reported multianalyst data. All band sizings from these and various ancillary studies (including some data for loci D1S7, D4S139, and D10S28) are available in the supplementary data to part 1.5 Formal standards exist for the conduct of interlaboratory studies and for the analysis of interlaboratory data.1-4 Neither the phase 2 nor phase 3 study fully conformed to these standards: not all loci of forensic interest were probed, only a limited range of band sizes were examined, not all laboratories reported the same number of replicate analyses, and no clear distinction was made between replicate analyses by the same analyst or by different analysts. Nonetheless, exploratory statistical evaluation of the intra- and interlaboratory components of variance has proven possible. (11) U.S. Congress, Office of Technology Assessment. Genetic Witness, Forensic Uses of DNA Tests; OTA-BA-438; U.S. Government Printing Office: Washington, DC, July 1990.

Table 1. Protocol Factors for the TWGDAM Phase 2 and Phase 3 Precision Studies code

factor

C L P Gl Gw Gt EEO B

K562 cell line source sizing ladder source sizing program source length of agarose gel width of agarose gel thickness of agarose gel electroendoosmality buffer used in gel and tank ethidium bromide in running buffer voltage at supply voltage across gel electrophoresis time membrane transfer time transfer solvent composition

EB Vs Vg Et M Tt Ts

levels multiple sources: a, b, c, d multiple sources: a, b, c, d, e multiple sources: a, b, c continuous (cm) continuous (cm) continuous (mm) binary: l (low), m (medium) binary: a (tris-acetate), b (tris-borate) EDTA binary: n (no), y (yes) continuous (kV) continuous (kV) continuous (hour) multiple types: a, b, c continuous (hour) multiple compositions: a (0.4 M NaOH); b (0.4 M NaOH, 0.5 M NaCl); c (0.5 M NaOH, 0.5 M NaCl); d (SSC 20×)

Systematic Procedural Differences. Even though examination of procedural differences was not an intended aspect of the early TWGDAM studies, considerable procedural information was collected from each laboratory. Table 1 lists the protocol differences (factors) for which information is available. Table 2 presents the evaluated settings (levels) for these factors for each laboratory in the phase 2 and 3 studies. About half of the participating laboratories modified one or more factor levels between studies. Exploratory Analysis. While we present some results from classical statistical techniques including formal analysis of variance (ANOVA), we primarily use exploratory graphical methods. The classical approaches are model-driven, employ assumptions, and can be rigorously quantitative. The graphical approaches are data driven, typically involve few assumptions, and tend to provide robust and readily intelligible, if qualitative, results. Use of exploratory graphics is explicitly encouraged in the most recent ASTM and ISO standards.1,2 One analysis tool is the boxplot, a device used for comparing sets of numbers by graphically displaying characteristics of the within-set distributions.12-14 The boxplot can be thought of as a visual one-way ANOVA or multiple comparisons test. Two general requirements must be met to assess properly whether different levels of a factor have differing influences on a measured response: (1) the factor must not be confounded with other factors and (2) there must be approximate balance between the number of data in compared sets. Four general effects can be revealed with boxplots: (1) differences in distribution location (box position), (2) differences in distribution dispersion (box size), (3) distribution asymmetry, and (4) presence of outlier values. The first two of these effects are of primary interest to this study: the median (denoted by a line within each box) is a robust estimate of set location; the interquartile range (IQR, the distance between the top and bottom of the box; i.e., the central 50% of the data) is a robust estimate of within-set dispersion. (12) Tukey, J. W. Exploratory Data Analysis; Addison-Wesley: Reading, MA, 1977. (13) Hoaglin, D. C.; Mosteller, F; Tukey, J. W. Fundamentals of Exploratory Analysis of Variance; John Wiley & Sons: New York, 1991. (14) Tufte, E. R. The Visual Display of Quantitative Information; Graphics Press: Cheshire, CT 06410, 1983.

Analytical Chemistry, Vol. 69, No. 10, May 15, 1997

1883

Table 2. Levels for Protocol Factorsa for All Laboratories in the TWGDAM Phase 2 and 3 Studiesb lab

C

L

P

Gl

Gw

Gt

EEO

B

EB

Vs

A B C D E F G H I J K L M N O P Q R Sc T V X Z

a c c c d; b d a c d d d d c c d d d d c d c c d

d; b d d d a d d d d d d d d c e; b b d d d d; b d d c

a b b b b b a b a b b b b b b a b b b b b c b

16; 14 16 15 16 14 20 16 25 16 16 15; 16 16 28; 20 16 14 20 22 14; 16 15 20 15 16 14

11 11 12 11 20

7 7 5 7 5

11 15 11 20 11 11

6 8 7

m m; l l m l l l l l l m m l m m l l m; l m l l m l

a a a a b a a a a b a a a a a a a a a a a a a

y y; n n y n n n n y n y n n y y n y y; n y n n n y

36 30; 28 30 32 33 35 22 32; 40 25 55

11 11 15 11 12 17 11 11

7 7 5 7 6 6 6 5 8; 7 6 7

Vg

14 15 15

28 28; 23 32 32 32 40 30; 28 15 31 30 32 30

16 17 26 12

Et

M

Tt

Ts

17; 16 17; 16 17 16 16 17 18 19; 17 17 17 17 17 17 16 16 16 16 17 17 16 19 17 16

c b c c b b a b b c b c; b b c c; b b c c; b b b b b c

6 6 6 6 16 6 4 6 6 6 6 6 6+ 6 6 6 6 6 6+ 7; 5 4 15 6

a a a a d a c a a a a a b a a a a a a a; c a a a

a See Table 1 for definition of the factors and their levels. b When changes were made between phases, the levels are listed: (phase 2); (phase 3). c Did not participate in phase 3 study.

Figure 1. Representative intralaboratory bivariate distributions for K562 sizings for two laboratories. Each scattergram presents the distribution of K562 sizings (band1 vs band2) at one genetic locus for two forensic laboratories, labeled “a” and “o”, respectively. The genetic locus for each scattergram is specified in the lower right-hand corner; the number of data in each distribution is given in the upper left-hand corner. The center of each distribution is marked with a labeled circle. The ellipses denote the data-defined 95%/95% bivariate-normal tolerance intervals for the two distributions. The scattergram axes each span 96%-104% of the NIST certified values for the given band of the given K562 locus; the dashed lines span 97.5%-102.5% of the NIST certified values. 1884 Analytical Chemistry, Vol. 69, No. 10, May 15, 1997

Confirmatory Analysis. We have evaluated quantitatively several of the statistically identified protocol effects through new experiments and detailed analysis of ancillary data. The methods used for each evaluation are documented in their appropriate sections. RESULTS AND DISCUSSION Figure 1 displays the apparently random within-laboratory (imprecision) and fixed between-laboratory (bias) components of RFLP measurement uncertainty. Each of the four scattergrams details K562 cell line control data at one DNA locus for two “most different” laboratories.5 Each scattergram plots the size of the larger K562 allele (band1) vs the smaller (band2); each axis spans 96%-104% of the NIST certified values for the given band of the given K562 locus.15 The approximately bivariate normal distributions are emphasized by 95%/95% tolerance ellipses that bound 95% of each laboratory’s data with 95% confidence.16 Each ellipse’s size indicates measurement imprecision (rather, precision) of each measurement pair; the difference in ellipse centers indicates measurement bias (rather, concordance). Highly replicated data such as those displayed in Figure 1 are seldom available with sufficiently detailed factor level information to permit characterization of protocol effects. Figure 2 shows the distribution of the TWGDAM phase 2 and 3 K562 data and 15 individual laboratory averages from the ancillary data described in part 1.5 The TWGDAM data, while derived from at most a few electrophoretic gels per laboratory, have about the same location and dispersion as do the laboratory averages. Figure 3 shows the distribution of laboratory averages for loci D1S7, D4S139, and D10S28. While the phase 2 and 3 studies examined only loci D2S44 and D17S79, the laboratory biases for all loci have roughly the same relative magnitude. The protocol effects documented in the TWGDAM phase 2 and 3 data thus appear characteristic of those encountered in forensic practice. It is important to distinguish between statistical and analytical significance. Many of the among-laboratory biases shown in Figures 1-3 are statistically significant, using either visual or formal evaluations; i.e., the null hypothesis that there are no measurable differences in the sizing data produced in different laboratories can be rejected with reasonable confidence. As shown in Figures 2 and 3, the observed range of band-size biases among the 15 laboratories results from very small migrationdistance differences. These biases fit within the range of sizes expected from a single laboratory. Because the DNA profile matching techniques currently used assume measurement variation severalfold larger than this range, these biases are of little or no practical importance.17,18 However, for any given match criterion, the number of potential matches that must be further evaluated will increase as the number of profiles available for comparison increases. Should it become necessary to increase the stringency of the matching techniques as the number of data increase, these same biases may well become limiting. We quantify the current magnitude of intra- and interlaboratory measurement variation in the following sections. (15) National Institute of Standards and Technology. Certificate of Analysis, Standard Reference Material 2390, DNA Profiling Standard; Standard Reference Materials Program, NIST: Gaithersburg, MD 20899, 1992. (16) Hall, I. J.; Sheldon, D. D. J. Qual. Technol. 1979, 11, 13-19. (17) Budowle, B.; Giusti, A. M.; Waye, J. S.; Baechtel, F. S.; Fourney, R. M.; Adams, D.E.; Presley, L. A.; Deadman, H. A.; Monson, K. L. Am. J. Hum. Genet. 1991, 48, 841-855. (18) Roeder, K. Stat. Sci. 1994, 9, 222-278.

Figure 2. Laboratory averages for K562 sizings from TWGDAM phase 2, phase 3, and population studies for loci D2S44 and D17S79. Each scattergram presents laboratory averages of the TWGDAM phase 2 and phase 3 data, labeled “2” and “3”, respectively. The bivariate distribution centerpoints for data from the 15 laboratories that contributed data to the TWGDAM population study (supplementary material, part 1 of this series5) are marked with circles labeled “a”s“o”. Each scattergram axis is scaled as in Figure 1. A representative image of a commercial sizing ladder is shown with assigned band sizes, with lines connecting the electrophoretic migration distance of the ladder with calculated band sizings on the scattergrams. The dark outer lines bracket (2.5% of the NIST certified value for the given band of the given locus; the light inner lines mark the observed range of the 15 laboratory distribution centerpoints. Sizing ladder image courtesy of Life Technologies, Inc. (Gaithersburg, MD).

ANOVA Demonstration of Among-Laboratory Bias. Table 3 presents results of an unbalanced design, linear model ANOVA analysis of the TWGDAM phase 3 band sizing.19 Results are provided for the band-size response variable as reported base pairs and reduced migration distance (R, effectively the migration distance of the unknown DNA fragment divided by the migration distance of the smallest sizing ladder component). While not used by TWGDAM laboratories, the statistical properties of sizing results reported in R may be more readily related to experimental effects than those reported in base pairs.9,20 Three known variance sources designed into the study are (1) genetic locus (“G”; D2S44 or D17S79), (2) DNA donor (“D”; K562, JA, JM, KH, or SC), and (3) band (“B”; high or low). Three potential variance sources are (1) sample replicate (“S”; a unique code assigned to each sample from each DNA donor), (2) (19) SAS Institute Inc. The GLM Procedure. In SAS User’s Guide: Statistics, Version 5 ed.; SAS Institute Inc.: Cary, NC, 1985; Chapter 20. (20) Eriksen, B.; Svensmark, O. Forensic Sci. Int. 1993, 61, 21-34.

Analytical Chemistry, Vol. 69, No. 10, May 15, 1997

1885

Figure 3. Laboratory averages for K562 sizings from TWGDAM population studies for loci D1S7, D4S139, and D10S28. Scattergrams are presented as in Figure 2.

laboratory (“L”; a unique code for each participant), and (3) analyst (“A”; a unique code for each set of data from a given participant). The following nested-variable model was designed from our prior knowledge of the RFLP system to include all known relationships among the design variables in a minimum number of effects:

band size ) βI + βAL + βBA(L) + βCS(D) + βDD + βEG(D) + βFB(D*G) + error

the 17-bp SD predicted by eq 7 of part 1 for bands of size 1852 bp (the mean of the band sizes examined) and the observed 0.12R SD in the relative position of sample and sizing ladder bands documented in Figure 3 of part 2.5,9 Quantification of Imprecision and Bias. Given that there is a statistically significant systematic laboratory effect, what is its magnitude? In part 3 of this series,10 we estimate SDtot for bands of size 1000-20 000 bp as

SDtot ) 7.5(1 + MBS/19500)7.1 where βI is the intercept, the βi’s are slopes for each effect, L is the laboratory main effect, A(L) is a nested within-laboratory analyst interaction, S(D) is a nested within-donor replicate sample interaction, D is the DNA donor main effect, G(D) is a nested within-donor genetic locus interaction, B(G*D) is a nested withindonor allelic band interaction for each locus, and error is the residual variance for which the model does not account. This model adequately describes the variance observed in the basic pairs and in the R transformed sizing data, with square correlation for both responses of 0.9995 and no apparent structure in the residuals. The effects of the known variance sources are very strong, as indicated by the small “Pr > F” probabilities. There is no systematic analyst effect (in Table 3, rms for A(L) is less than the error rms), a marginally significant sample effect (rms for S(D) is less than twice the error rms), but a clearly significant laboratory effect (rms for L is ∼8 times the error rms). The sample effect may be attributable to the fixed order of sample and sizing ladders specified in the study design. The residual error terms estimate the imprecision component of SDtot in a “typical” laboratory as 12 bp and 0.13R, respectively, for the two models. These values are in excellent accord with 1886

Analytical Chemistry, Vol. 69, No. 10, May 15, 1997

Pooling the individual laboratory sizing SDs for the 10 K562 fragments from the 15 laboratories described in the ancillary data to part 15 provides an estimate of within-laboratory imprecision, SDimp. The SD among the mean values for the 15 laboratories (these means are shown in Figures 2 and 3) is a direct estimate of the expected among-laboratory bias, SDbias. These estimates are presented in Figure 4. Imprecision and bias contribute about equally to SDtot. Note that SDbias is larger than expected for both K562 alleles at locus D1S7. Examination of Specific Protocol Effects. In this section, we examine the laboratory biases observed in the data. The protocol variations among laboratories in the TWGDAM phase 2 and 3 studies are sufficient in number and distribution to enable exploratory analysis. Not all the variation can be examined; differences unique to individual laboratories and differences that are highly correlated across all laboratories may be confounded in the analysis. All the factors listed in Table 1 are sufficiently replicated and orthogonal to support statistical discrimination. Boxplots were evaluated for all statistically testable effects for all samples at both loci. Example boxplot analyses are given in

Table 3. Analysis of Variance for the TWGDAM Phase 3 Precision Study levels design variables

code

no.

names

laboratory analyst of given laboratory DNA donor sample replicate for given donor genetic locus band at given genetic locus

L A D S G B

22 6 5 10 2 2

A-R, T, V, X, Z 1-6 K562, JA, JM, KH, SC Control, Q1-Q9 D2S44, D17S79 1, 2

range of data min mean max

band size response variables code base pairs relative migration distance

bp R

no. of data available missing

1255 1852 3173 0.0442 0.1113 0.2542

1069 1069

185 185

Model:b band size ) βI + βAL + βBA(L) + βCS(D) + βDD + βEG(D) + βFB(D*G) + error band size as bp band size as R source G(D) D B(D*G) L S(D) A(L) model error

df

SS

SD

Pr > F

SS

5 133000000 5157 0.0001 1.623 4 90250000 4383 0.0001 1.060 9 76840000 3167 0.0001 0.944 21 182800 93 0.0001 0.0022 5 2213 21 0.012 0.00002 11 937 9 0.86 0.00001 55 1013

31764000 2403 0.0001 3.8427 151700

12

0.0018

SD

Pr > F

0.5698 0.4858 0.3432 0.0104 0.0023 0.0010

0.0001 0.0001 0.0001 0.0001 0.011 0.83

0.2643 0.0001 0.0013

a See refs 13 and 19 for details of ANOVA terminology, principles, and procedures. b where R ) -0.056 + (1.31/(1 + 3600/bp))1.9, an approximate relationship described in ref 9; βI is the model intercept; βi is the slope for a model effect; main effects are specified by the variable code itself; balanced interactions are specified by joining the variables with an asterisk; nested interactions are specified with parentheses; df is the number of degrees of freedom for each variation source; SS is the sum of squares attributable to the variation source (“type III” for the effects); SD is the root-mean-square estimate of standard deviation attributable to the variation source; Pr > F is the significance probability that the true slope for the model parameter is zero.

Figure 5 for four different effects on the phase 3 data for the JM samples at the D2S44 locus. Ethidium Bromide. The fluorescent dye ethidium bromide (EB) is commonly used as a means of visualizing DNA fragments in electrophoretic gels. Early forensic practice made extensive use of this intercalating agent to monitor visually the electrophoresis of analytical gels. Most laboratories now minimize EB usage, reflecting concerns for analyst health and disposal safety. In the phase 2 and 3 studies, both location and dispersion effects attributable to the use of EB in the running buffer are easily discernible. As seen in Figure 5, the use of EB in the running buffer may cause changes in location: the median size of one band is significantly increased when EB is used. Both increases and decreases in measured fragment size of up to 2% of band size were observed with other samples. This effect is a reasonable and anticipated result. EB intercalation into both ladder and sample DNA fragments is a function of fragment composition, fragment sequence, and relative concentration of EB as well as of the total fragment size.21 Thus, the sign of the effect for uncharacterized samples is intrinsically unpredictable as it depends upon the relative amount of EB intercalating into given sizing ladder and sample DNA fragments. Figure 5 also shows a (21) Waye, J. S.; Fourney, R. M. Appl.& Theor. Electrophor. 1990, 1, 193-196.

Figure 4. Imprecision and bias components of total measurement uncertainty. The solid line displays the expected total uncertainty, SDtot, using the relationship determined in part 3.10 The filled circles denote the expected within-laboratory imprecision component, SDimp, estimated by pooling the 15 individual laboratory sizing SDs for 10 K562 bands (see part 1 supplementary material5). The open circles denote the bias component for the same data, SDbias, estimated as the SD of the 15 individual laboratory means.

significant dispersion effect: the IQR is typically larger (i.e., band sizing is more variable) when EB is used in the running buffer. This may reflect interlaboratory differences in the concentration of EB in the buffer. Electroendoosmality. Agar is a naturally occurring polymeric mixture of widely varying composition. The refined agar (agarose) used in gel electrophoresis varies according to source and processing technology. Many agarose types with a variety of empirically defined properties are commercially available. One of the major properties, the electroendoosmotic (EEO) character, reflects complex ionic differences that strongly influence biopolymer migration through electrophoretic gels.22 Medium-EEO agarose was originally suggested for forensic use. Most forensic laboratories now use low-EEO agarose, at least in part due to the widespread use of this material in molecular biology and its consequent commercial availability. As seen in Figure 5, laboratories using medium-EEO agarose consistently report larger band sizes than those using low-EEO agarose. This increased band-size effect is consistent for all bands of all samples. The magnitude of the effect appears to be fragment-specific, with increases on the order of 0.8%-3% of band size seen in the TWGDAM phase 3 data. Combined EB and EEO Effect. Because of confounding between EB and EEO (nearly all laboratories that use mediumEEO agarose also use EB, those that use low-EEO agarose tend not to use EB), a full analysis of these two effects is not currently possible. However, the two effects do appear to interact. Figure 6 documents the results of changing from EB in the running buffer and medium-EEO agarose to no EB use and low-EEO agarose in one laboratory. Note that five of the eight alleles displayed in (22) Allen, R. C.; Budowle, B. Gel Electrophoresis of Proteins and Nucleic Acids: Selected Techniques; Walter de Gruyter: Berlin, 1994.

Analytical Chemistry, Vol. 69, No. 10, May 15, 1997

1887

Figure 5. Representative boxplot analysis. Each subplot contrasts band-size distributions for one band of the locus D2S44 JM samples analyzed in the TWGDAM phase 3 study. Four sets of main effect contrasts are shown: “no use” vs “use” of EB in the running buffer, “low” vs “medium” agarose, “type 1” vs “type 2” sizing ladder, and “small” vs “large” gel volume. The largest reported value of each distribution is indicated by the top of the vertical line, the third quartile (75%) by the top of the central box, the median by the central horizontal line, the first quartile (25%) by the bottom of the central box, and the smallest reported value by the bottom of the vertical line. The distance between the top and bottom of the central box (the interquartile range) is a robust estimate of the distribution’s dispersion; the median is a robust estimate of location. The width of the central box indicates the relative number of data described.

the scattergrams appear to be larger under the (EB, medium EEO) conditions than with (no EB, low EEO). However, the smaller allele of D4S139 is not substantively affected and both alleles of D1S7 appear smaller under the (EB, medium EEO) conditions. K562 Cell Line Supplier and Form. Several immortalized cell lines have been used to help ensure forensic RFLP data quality. K562, a female cell line maintained by the American Type Culture Collection (Rockville, MD) has become a de facto standard.23 Two different commercial suppliers of the K562 standard cell line were used in this study. Boxplot analysis (not shown) indicated band-specific differences of 0.5%-1.0% between these two sources; as there is no a priori reason to suspect differences in size or composition of HaeIII-digested fragments of K562 DNA, these differences were not anticipated. Once the K562-related bias was identified, we realized that the supplier of K562 DNA is insufficient to characterize fully the diversity of the material. At least three forms of K562 are commercially available from one or more suppliers: intact cell pellets (“cellular”), extracted genomic DNA (“genomic”), and HaeIII restriction digest mixed with loading buffer (“precut”). Figure 7 displays the sizing biases between precut K562 and within-laboratory extracted/digested cells observed by one laboratory over a period of three years. For this laboratory, the precut K562 sizings average a remarkably constant 0.6% higher than those of the cellular material. (23) Federal Bureau of Investigation. Standards for CODIS Acceptance of DNA RFLP Data at NDIS, Draft 20-May-1996; U.S. Department of Justice, Washington, DC 20535, 1996; pp 4-7.

1888

Analytical Chemistry, Vol. 69, No. 10, May 15, 1997

Reevaluation of interlaboratory data obtained during the certification of NIST DNA Profiling SRM 2390,15 however, does not support the conclusion that the observed differences are explained by the form of the K562 DNA alone. While distinct sizing populations had been documented for each of the three DNA forms provided in the NIST SRM, the magnitude and pattern of bias among the forms appear laboratory-specific. While further research is clearly necessary, we believe all observed K562 source and form differences may be manifestations of systematic differences in concentration and composition of the sample loading buffers used with the various K562 forms. Electrophoretic separation of DNA is very sensitive to the exact composition and concentration of the electrolyte buffer used.22 We expect that very minor laboratory-specific differences in electrolyte composition (and/or the “inerts” used to stabilize the precut material) may arise in the detailed handling of commercial precut K562 DNA relative to within-laboratory extracted samples. The magnitude and direction of the resulting bias would thus be consistent within a laboratory, yet unpredictable across laboratories. Sizing Ladder. Internal calibration to DNA fragments of known sequence permits conversion of fragment migration distance to band size.9 Two related ladders from one commercial source were used by all but one laboratory for the phase 2 study. By the completion of the phase 3 study, four laboratories had converted to using ladders from other commercial sources. There is some evidence in the phase 3 data of systematic differences between ladders from the two sources, with the magnitude and sign of the bias being band-size-specific.

Figure 6. Observed intralaboratory differences associated with shift from (EB use, medium-EEO agarose) to (no EB, low-EEO agarose). Each scattergram presents K562 sizings from casework within one laboratory before and after an intentional protocol modification. The “+” denote sizings from medium electroendoosmotic (ME) agarose gels run with EB in the running buffer. The “-” denote sizings from low electroendoosmotic (LE) agarose gels run without EB. Centering, scale, labeled circles, and ellipses are as in Figure 1. The solid diamonds in D1S7, D2S44, and D4S139 scattergrams represent sizings from intentionally overexposed autoradiograms derived from one gel. Data courtesy of Eric Buel, Vermont Department of Public Safety.

The two most commonly used RFLP sizing ladders have different hybridization requirements, making direct comparison using standard protocols difficult. However, the ladders may easily be indirectly compared through gel-specific empirical calibration functions of form

R ˜ i ) β1 + β2/(1 + βπ/β3)β4 + β5 ln(bpi/bpm)4 + error

where R˜ i is the measured migration distance of the ith ladder component, normalized so that the ladder origin is at zero and the total length is 1; bpi is the nominal band size of the component, and bpm is the median component size. The basic sigmoidal form of this function is discussed in part 2;9 the additional logarithmic term accounts for residual structure. The “apparent” size of each ladder component (i.e., the size in best electrophoretic agreement with neighboring components) can thus be calculated. Replacement of the nominal size of ladder bands with these apparent sizes permits identification of local biases relative to a global calibration model. Figure 8 demonstrates the expected bias as a function of MBS for two widely used sizing ladders. Given the uncertainties of the indirect comparison, we expect little sizing bias over most of the studied 1000-10 000 bp range. There is, however, strong evidence for a positive bias of more than 1% in the region of 4300 bp. From visual examination of global calibration curves for many

different autoradiograms, this bias is caused by the anomalous electrophoretic mobility of one particular ladder component relative to its nominal sequence-assigned size. Both K562 bands at locus D1S7, having NIST certified sizes of 4571 and 4231 bp,15 should be influenced by the predicted bias between sizing ladders. Figure 9 documents K562 band-size distributions observed within one laboratory while using the two ladders. Except for the locus D1S7 bands, all distributions are similar. Both D1S7 bands show the expected 1% bias. This also accounts for the excess bias component for the D1S7 bands observed in Figure 4. Sizing Software and Interpolation Algorithm. While most participants in the TWGDAM studies utilized FBI-supplied imaging software,24 four laboratories provided results obtained using one commercial supplier’s software. Exploratory analysis indicated that there is no significant sizing bias between these two software sources. This was confirmed by the absence of significant between-software differences in the 1991 TWGDAM phase 1B image analysis comparison study, where four autoradiograms were independently analyzed in 21 different laboratories.5 A related potential problem was “brought to our attention” during the course of NIST analysis of the TWGDAM Large Fragment study autoradiograms.10 Local-linear and local-loga(24) Monson, K. L. Crime Lab. Dig. 1988, 15, 104-105.

Analytical Chemistry, Vol. 69, No. 10, May 15, 1997

1889

Figure 7. Observed intralaboratory differences associated with use of two different forms of K562 DNA, genomic and commercially precut. Each scattergram presents K562 sizings by one laboratory that periodically evaluates commercially available materials. Laboratory-cut genomic K562 DNA data are labeled “p”; commercially precut K562 DNA are labeled “q”. Centering, scale, labeled circles, and ellipses are as in Figure 1. Data courtesy of Christine Tomsey and Beth Giles, Pennsylvania State Police.

Figure 8. Expected sizing differences associated with use of two commercially available sizing ladders. The dark line displays the percent bias expected for sample bands of size 800-10 000 bp if sizing was performed with either of two commercial sizing ladders, labeled “r” and “s”, respectively. Shaded regions represent (1 SD about the expected bias. Labeled circles mark the bias expected for K562 bands at loci: D1S7 (“1”), D2S44 (“2”), D4S139 (“4”), D10S28 (“10”), and D17S79 (“17”).

rithmic interpolation algorithms provide band-size estimates that differ by 2% or more for some band sizes larger than 10 000 bp. While apparently all forensic laboratories use the same basic local1890

Analytical Chemistry, Vol. 69, No. 10, May 15, 1997

logarithmic algorithm, for at least one image analysis system it is relatively easy to select the “wrong” method. Further, the routine output provided by that system does not state which interpolation method was used. Gel Dimensions and Applied Voltage. No clear effects were observed for any of the simple dimensional variables of gel length, width, or thickness. This does suggest that the enhanced resolution often pursued with use of a longer gel format has little impact on band sizing. However, there is an effect related to gel area (length × width) and even more strongly to gel volume (area × thickness). Band sizes for “large” area or volume gels average 1% larger than those in “small” area or volume gels. As the equipment and materials used for agarose gel electrophoresis within each laboratory are held constant after initial characterization, most TWGDAM participants monitor the output voltage of the electrophoretic power supply rather than voltage gradient across each gel. While output voltage is adequate for intralaboratory method control and diagnostic procedures, the lack of correlation between output voltage and the across-gel gradient limits any detailed analysis. However, some specific bands are up to 1% smaller when the highest reported applied voltages are contrasted with the lowest. The gel area, gel volume, and applied voltage biases may arise from temperature-dependent band diffusion properties. These may arise from differential Joule heating, heterogeneous buffer circulation, variation in the depth of running buffer above the

Figure 9. Observed intralaboratory differences associated with sizing ladder change. Each scattergram presents K562 sizings within one laboratory that changed from sizing ladder “r” to sizing ladder “s”. Centering, scale, labeled circles, and ellipses are as in Figure 1. Data courtesy of Renee Romero and Berch Henry, Washoe County, Nevada, Sheriff’s Office.

surface of the gel, changes in laboratory temperature, etc. Further analysis is needed to explore these possibilities. Other Factors. The TWGDAM data provide no evidence for band-sizing biases attributable to the gross composition of the running buffer, the source of immobilizing membrane, duration of electrophoresis, duration of transfer, composition of the transfer solutions, or restriction enzyme source. (An anonymous manuscript reviewer indicated that the source and the lot of restriction enzyme have given rise to sizing biases.) One datum shown in Figure 6 suggests that optical density of sample bands may give rise to unexpectedly complex sizing artifacts. The data represented as solid diamonds on the locus D1S7, D2S44, and D4S139 scattergrams are from consciously

overexposed autoradiograms for one gel. Overexposure was required to confirm the presence of a very weak sample band. The sizes of the widely separated D2S44 and D4S139 K562 control bands are entirely normal; the sizes of the closely spaced D1S7 K562 bands are each unexceptional but do not show the expected correlation structure. CONCLUSIONS AND RECOMMENDATIONS Within-laboratory measurement imprecision and among-laboratory measurement biases (including those within a given laboratory after changing some aspect of the measurement protocol) contribute about equally to the total observed interlaboratory RFLP band sizing uncertainty. Minimizing biases among laboratories Analytical Chemistry, Vol. 69, No. 10, May 15, 1997

1891

is thus as important to the effective interjurisdictional exchange of RFLP data as is minimizing measurement variance within each laboratory. As the number and sizes of convicted-offender databases increase, the number of “matches” at any set “match window” must also increase. While the magnitude of current interlaboratory biases is of little current significance for the databases and comparison criteria now in use, control and further minimization of bias will be required to ensure that the number of such matches remains tractable. Inclusion of one or more DNA controls (such as K562) in every forensic RFLP measurement is essential for monitoring measurement performance. In addition to providing a mechanism for judging the acceptability of each unique set of RFLP measurements, analysis of the distributions of the accumulated control data permits identification of phenomena not readily apparent in closely related gels. These distributions do not randomly change over time; the changes that do occur are associated with known procedural or materials modifications. While such protocol changes are carefully designed and validated to ensure measurement comparability, measurement reproducibility is sufficiently high and the number of data generally sufficiently large to permit reliable identification of relative sizing changes of 0.1% or less. Our analysis indicates that any protocol change may result in such a small but identifiable bias. Further, the direction of the bias is often fragment-specific and thus unpredictable for any unknown sample. While forensic analysts have the greatest influence on any given RFLP measurement through the innumerable sample handling, extraction, electrophoretic, image analysis, and datarecording steps, the consequences of missteps are generally obvious. Most of the biases we have observed result from intentional choices made at the laboratory supervisor level and/ or financial constraints imposed upon the responsible agency. These effects include the following: use of ethidium bromide in the running buffer; the nominal electroendoosmotic characteristics of the agarose; choice of sizing ladder; source of restriction endonuclease; source and form of control materials; physical dimensions of the gel; characteristic electrostatic fields applied and temperature control and buffer recirculation techniques used; choice of imaging system hardware and software; and choice of interpolation algorithms when imaging software provides such options. However, some bias sources are not within a given laboratory’s control. The performance characteristics of sizing ladders, control materials, buffers, agarose, restriction enzyme, probe activity, and imaging software systems supplied to the forensic community usually are determined by the manufacturer. Individual laboratories need to confirm the performance of all materials and equipment prior to casework use; exchange of such quality assurance information among laboratories may reduce the individual effort required. Neither exploratory nor confirmatory analysis can characterize information that is not built into the data. We suggest further

1892

Analytical Chemistry, Vol. 69, No. 10, May 15, 1997

direct experimental characterization of (1) performance measures for agarose, including but not limited to EEO; (2) temperature measurement and control of the gel during electrophoresis; and (3) the effects of buffer recirculation. Currently available data support the following suggestions for the control and further reduction of band-sizing biases among forensic laboratories: (1) Band sizes based on excessively intense sample or ladder bands are known with less than typical certainty. All intergel comparisons using such bands should take this increased variability into account. (2) Different interpolation algorithms (e.g., local linear and local logarithmic) can produce significantly different results, particularly for bands of size greater than 10 000 bp. All laboratories should use the same interpolation algorithm. (3) Different sizing ladders can produce significantly different results. All laboratories should adopt the same sizing ladder. (4) Even minor variations in loading buffer composition systematically affect apparent band sizes. Control samples should be prepared in the same manner and loaded on gels with the same reagents used for all other DNA samples. (5) Inclusion of EB in the electrophoretic running buffer has a relatively strong, fragment-specific influence on band size. All laboratories should adopt the same policy toward EB use. (6) Use of locally determined target values and acceptance ranges for cell line control band sizes perpetuates among-laboratory bias. All laboratories should use the same target values and acceptance ranges. ACKNOWLEDGMENT We again thank all the individuals and institutions who participate in TWGDAM-sponsored characterizations of forensic DNA-profiling technologies. Truly, without their cooperationsin both the data and the insights they so willingly sharesour studies would not have been possible. We particularly wish to thank Eric Buel and the Vermont Department of Public Safety; Christine S. Tomsey, Beth Giles, and the Pennsylvania State Police; and Renee Romero, Berch Henry (now with the Alabama Department of Forensic Sciences), and the Washoe County, Nevada, Sheriff’s Office for their devoted experiments and exemplary data. Elizabeth A. Benzinger of the Ohio Bureau of Criminal Identification and Investigation, Ron Fourney of the Royal Canadian Mounted Police, John Hartmann of the Orange County, California, SheriffsCoroner Department, and Katherine Sharpless of NIST provided us with exceptionally patient guidance and criticism. Life Technologies, Inc. (Gaithersburg, MD) kindly provided the sizing ladder image we use to connect numerical abstraction to experimental reality. This study was supported in part by the National Institute of Justice and the Federal Bureau of Investigation, U.S. Department of Justice. Received for review October 16, 1996. Accepted March 6, 1997.X AC961070K X

Abstract published in Advance ACS Abstracts, April 15, 1997.