Anal. Chem. 1997, 69, 1406-1413
NIST/NCI Micronutrients Measurement Quality Assurance Program: Measurement Repeatabilities and Reproducibilities for Fat-Soluble Vitamin-Related Compounds in Human Sera David L. Duewer,* Jeanice Brown Thomas, Margaret C. Kline, William A. MacCrehan, Robert Schaffer, Katherine E. Sharpless, and Willie E. May
Analytical Chemistry Division, Chemical Science and Technology Laboratory, National Institute of Standards and Technology, Gaithersburg, Maryland 20899 James A. Crowell
Chemoprevention Branch, Division of Cancer Prevention and Control, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892
The NIST/NCI Micronutrient Measurement Quality Assurance Program has conducted 33 interlaboratory comparison exercises for fat-soluble vitamin-related compounds in human sera over the past 12 years. Periodic reanalysis of lyophilized serum samples prepared from more than 70 different sera has enabled estimation of the short- and long-term measurement characteristics. Median- and interquartile-range-based statistics adequately estimate the distribution of results from laboratories that are in analytical control from total distributions that include a significant minority of outlier data. Short-term interlaboratory reproducibility standard deviations (SDs) are predictable functions of analyte concentration, with an asymptotic limit at low analyte concentration and a linear relationship at high concentrations. Long-term trends in the interlaboratory reproducibility can be estimated by standardizing the short-term SD at the observed analyte concentration to an expected SD at a given physiologically significant analyte concentration. The “average” laboratory’s same-day analytical repeatability SD is about one-third of the estimated interlaboratory reproducibility; repeatability for longer periods between analyses is, on average, no better than the reproducibility. While a few exceptional laboratories have maintained excellent repeatability over the entire decade, long-term study measurements generated within a single laboratory are not generally more internally consistent than results from multiple laboratories. Enhanced and more consistently implemented intralaboratory quality control and quality assurance methods are required to further improve and maintain interlaboratory measurement comparability. In 1984, the National Institute of Standards and Technology (NIST) and the National Cancer Institute (NCI) initiated what is now known as the NIST/NCI Micronutrients Measurement Quality Assurance Program (QAP). This program was designed to support and enhance analytical measurements of micronutrients with potential cancer chemopreventive activity; retinol, R-toco1406 Analytical Chemistry, Vol. 69, No. 7, April 1, 1997
pherol, and β-carotene isomers have been the analytes of greatest interest.1-5 The reference material and analytical method development components of the QAP are reported elsewhere,6-16 as are longterm repeatabilities of NIST measurements.17 We report here interlaboratory measurement characteristics for retinol (R), retinyl palmitate (RP), R-tocopherol (RT), γ-tocopherol (γT), total β-carotene (TβC), trans-β-carotene (tβC), total R-carotene (RC), β-cryptoxanthin (βCr), and total lycopene (Ly) over the past decade. Knowledge of the historical measurement performance for these (1) Diplocle, A. T., Machlin, L. J., Packer, L., Pryor, W. A., Eds. Vitamin E: Biochemistry and Health Implications; Annals of the New York Academy of Sciences 570; New York Academy of Sciences: New York, 1989. (2) Bendich, A., Butterworth, C. E., Jr., Eds. Micronutrients in Health and in Disease Prevention; Marcel Dekker, Inc.: New York, 1991. (3) Sauberlich, H. E., Machlin, L. J., Eds. Beyond Deficiency: New Views on the Function and Health Effects of Vitamins; Annals of the New York Academy of Sciences 669; New York Academy of Sciences: New York, 1992. (4) Canfield, L. M., Krinsky, N. I., Olson, J. A., Eds. Carotenoids in Human Health; Annals of the New York Academy of Sciences 691; New York Academy of Sciences: New York, 1993. (5) Krinsky, N. I., Sies, H., Eds. Antioxidant vitamins and β-carotene in disease prevention. Am. J. Clin. Nutr. 1995, 62 (supplement). (6) MacCrehan, W. A.; Scho ¨nberger, E. Clin. Chem. 1987, 33, 1585-1592. (7) MacCrehan, W. A.; Scho ¨nberger, E. J. Chromatogr. Biomed. Appl. 1987, 417, 65-78. (8) MacCrehan, W. A. Methods Enzymol. 1990, 189, 172-181. (9) Craft, N. E.; Sander, L. C.; Pierson, H. F. J. Micronutr. Anal. 1990, 8, 209221. (10) Craft, N. E.; Wise, S. A.; Soares, J. H., Jr. J. Chromatogr. 1992, 589, 171176. (11) Epler [Sharpless], K. S.; Sander, L. C.; Ziegler, R. G.; Wise, S. A.; Craft, N. E. J. Chromatogr. 1992, 595, 89-101. (12) Epler [Sharpless], K. S.; Ziegler, R. G.; Craft, N. E. J. Chromatogr. Biomed. Appl. 1993, 619, 37-48. (13) Sander, L. C.; Epler Sharpless, K. S.; Craft, N. E.; Wise, S. A. Anal. Chem. 1994, 66, 1667-1674. (14) National Institute of Standards and Technology. In Methods for Analysis of Cancer Chemopreventive Agents in Human Serum; Brown Thomas, J., Sharpless, K. E., Eds.; NIST Special Publication 874; U.S. Government Printing Offiice: Washington, DC, 1995. (15) Brown Thomas, J.; Kline, M. C.; Schiller, S. B.; Ellerbe, P. M.; Sniegoski, L. T.; Duewer, D. L.; Sharpless, K. E. Fresenius J. Anal. Chem. 1996, 356, 1-9. (16) Sharpless, K. E.; Brown Thomas, J.; Sander, L. C.; Wise, S. A. J. Chromatogr. Biomed. Appl. 1996, 678, 187-195. (17) Sharpless, K. S.; Duewer D. L. Anal. Chem. 1995, 67, 4416-4422.
S0003-2700(96)00772-X This article not subject to U.S. Copyright. Publ. 1997 Am. Chem. Soc.
analytes may enhance interpretation of multilaboratory clinical and epidemiological studies involving measurements of these and related fat-soluble analytes. There are well-established procedures for characterizing standard analytical measurement performance through interlaboratory studies, even where each laboratory is free to use its own analytical methods.18-21 There are no established methods for documenting changes in interlaboratory measurement capabilities over time for studies where neither the samples nor the laboratory/analyst population are held constant. Because participation in the QAP is open and voluntary, the number of laboratories reporting results has varied considerably from exercise to exercise, few laboratories have participated in all the QAP interlaboratory exercises, and the number of and prior analytical experience of the reporting analysts is not documented. This paper describes the quantitative evaluation of many longterm interlaboratory measurement characteristics, including the following: (1) the number and nature of the QAP participants and of the QAP samples, (2) the data analysis methods used to characterize results of each QAP exercise, (3) interlaboratory measurement reproducibility as a function of analyte concentration and time, and (4) the distribution of intralaboratory measurement repeatabilities relative to interlaboratory measurement reproducibility. METHODS AND MATERIALS Interlaboratory Comparison Quality Assurance Program. The first NIST/NCI Micronutrients Measurement QAP interlaboratory comparison exercise, termed Round Robin I (RR I), was conducted in Fall 1984 and involved 20 laboratories including NIST. In Spring 1996 (exercise RR XXXVI), 51 laboratories (of the 62 receiving samples) returned results. There have been a total of 33 QAP exercises for fat-soluble vitamin-related compounds (three early exercises were entirely devoted to trace metal and/ or ascorbic acid measurements). Participants have received tabular and graphical analyses of their results at the completion of each exercise. Workshops have been held annually to discuss past results, share advanced measurement methodologies, and prioritize future efforts. Analytes. Whereas three analytes were reported in the first 3 years of the QAP (R, RT, TβC), more than 18 analytes were reported in RR XXXVI. Figure 1a presents the number of laboratories reporting various analytes as a function of time; all analytes reported by more than 10 laboratories in RR XXXVI are depicted. Three groups of analytes can be identified: most laboratories report R, RT, and TβC; about 50% report RC, Ly, γT, and/or βCr; and 25% report RP and/or tβC. Laboratories. Of the 100 laboratories that have participated in at least one QAP exercise, eight of the 24 “original” laboratories are still actively involved, and three laboratories have participated in all 33 exercises. Figure 1b presents the total number of laboratories providing results over time, along with the number (18) International Organization for Standardization. Precision of Test Methodss Determination of Repeatability and Reproducibility for a Standard Test by Interlaboratory Tests; ISO 5725; ISO: Geneva, Switzerland, 1986. (19) American Society for Testing and Materials. Standard Practice for Conducting an Interlaboratory Study to Determine the Precision of a Test Method; ASTM E 691-87; ASTM: Philadelphia, PA, 1987. (20) Association of Official Analytical Chemists. Statistical Manual of the AOAC; AOAC International: Arlington, VA, 1987. (21) Mesley, R. J.; Pocklington, W. D.; Walker, R. F. Analyst 1991, 116, 975990.
Figure 1. Analyte and laboratory participation history. These curves display the number of QAP participants over time. The upper section (a) displays the number of laboratories reporting a given analyte for all analytes reported by at least 10 participants in RR XXXVI. The lower section (b) displays the total number of participants, the number of new participants, and the number remaining of the original group that participated in the earliest exercises.
of original participants remaining and the number of “new” (no data for any of the preceding three exercises) participants. Laboratories with a number of different primary missions participate in the QAP, including fundamental research, clinical application, product quality control, and demonstration of measurement capability. Approximately 40% of current participants are associated with universities, 40% are independent clinical support laboratories, 15% are consumer product manufacturers, and 5% are government. Approximately 60% are located in North America (representing one province, 21 states, and the District of Columbia), 25% are in Europe (11 countries), and the remainder are in Bangladesh, Ghana, New Zealand, the People’s Republic of China, Peru, South Africa, and Taiwan. Methods. Since there are no true “standard” methods for any of the fat-soluble vitamin-related analytes in serum, each QAP laboratory has its own analytical method. Most current participants use rather similar extraction procedures and reversed-phase liquid chromatography (LC) methods. This consensus has partly resulted from the three essential elements of the QAP program: method development, evaluation of exercise results, and technology transfer during workshops. Briefly, the analytes of interest are extracted from the serum into a nonpolar volatile organic solvent, transferred to a solvent miscible with the LC mobile phase composition, separated with LC using absorbance and/or fluorescence detection, and quantified using a variety of internal standards.6 Analytical Chemistry, Vol. 69, No. 7, April 1, 1997
1407
Samples. The earliest interlaboratory exercises identified that lyophilized serum was the most appropriate matrix for study by virtue of its ease of handling, analyte stability, and relevance to the measurement problem. Since Summer 1987 (RR XII), three exercises have been held per year, with distribution of three to five lyophilized samples of human serum per exercise.15 The longterm stability of fat-soluble vitamin-related compounds in this matrix when stored at -80 °C has been documented.22 Serum pools, yielding approximately 250 individual samples, have typically been blended to achieve various low-normal to highnormal concentrations of R, RT, and TβC. A few sera with high endogenous carotenoid levels have been obtained through nutritional supplementation. Samples were also augmented with exogenous R, RP, RT, and/or γT. As of RR XXXVI, 144 blind samples derived from 71 different sera had been distributed. The samples distributed in any given QAP exercise are selected to span a range of analyte levels and/or to document some particular analytical capability. Control and Reference Materials. Four sera were prepared in 1987 in 800-vial lots for use as control materials, as well as for distribution as blind samples. Used as control materials, these samples and their NIST-assigned levels of R, RT, and TβC were supplied to participants in several exercises held during 1988 and 1989. All participants were requested to use the controls to validate their analytical systems before analysis of blind samples. Nine sera have been produced in 1800-vial lots for use as NIST Standard Reference Materials (SRMs) for the vitamin-related analytes: SRM 968 in 1989, SRM 968a in 1992, and SRM 968b in 1995. All of these SRM sera have been distributed from one to three times in the QAP as blind samples. Measures of Measurement Variation. Short- and Long-Term Repeatabilities. Repeatability is formally defined as the closeness of agreement among same-method results obtained on independent test items in the same laboratory by the same operator using the same equipment within short intervals of time.23 For the purpose of characterizing the multiexercise QAP data, we define “repeatability” to represent replicate measurements reported by a given laboratory regardless of the number or identity of the analysts actually performing the measurements. This generalization potentially includes such sources of variation as multiple analysts, different materials, equipment changes, and protocol modifications. A repeatability standard deviation for a given laboratory is most directly estimated using measurements from multiple aliquots of multiple samples of a given material. It can be calculated using the general formula for pooling variance:
SD )
x
n
∑(n - 1)SD j
j)1
n
2
j
/((
∑n ) - n) j
j)1
where n is the number of samples, nj is the number of independent analyses of a given sample, and SDj is the standard deviation over all analyses of the given sample. However, this does not necessarily capture the same components of measurement varia(22) Brown Thomas, J. Stability of Retinol, R-Tocopherol, and β-Carotene in Lyophilized and Liquid-Frozen Sera; Report of Analysis 839.02-97-001; Analytical Chemistry Division, 222/B208, NIST: Gaithersburg, MD, Oct 22, 1996. (23) International Organization for Standardization. StatisticssVocabulary and Symbols; ISO 3534-1; ISO: Geneva, Switzerland, 1993.
1408
Analytical Chemistry, Vol. 69, No. 7, April 1, 1997
tion for all laboratories: some participants have reported only one average value for two unblinded vials while others have reported values for two injections of a single sample extract. We estimate repeatabilities using (1) a small number of replicate blind samples distributed in the same exercise and (2) a larger number of blind replicates distributed in two or more exercises. Same-exercise replicates enable estimation of “shortterm” repeatabilitysthe extent of agreement among a given laboratory’s measurements made over periods of hours to days. Likewise, different-exercise replicates enable estimation of “longterm” repeatabilitysthe extent of agreement among a given laboratory’s measurements over periods of months to years. Short- and Long-Term Reproducibilities. Reproducibility is formally defined as the closeness of agreement among samemethod results obtained in different laboratories by different operators using different equipment.23 We equate “short-term reproducibility” with this formal definition applied to results from a single exercise, using one value for each sample reported by a given laboratory. “Long-term reproducibility” applies to similarly averaged results compared across two or more exercises. RESULTS AND DISCUSSION Estimating Short-Term Reproducibility. Measurement results for a given sample are typically summarized using the mean to estimate population location and the standard deviation (SD) to estimate population dispersion. Since mean and SD estimates are sensitive to the presence of very high and/or very low data values, identification and removal of “outlier” values are recognized as components of the data analysis.18-21 Most of the data analysis presented in past QAP summary reports was based on data censoring using “eyeball” evaluation of each laboratory’s submitted values by experienced NIST analysts. As the number of participants increased, this highly subjective approach proved cumbersome, and the need for a more objective (and quicker) approach was recognized. Percentile-based statistics can provide objective and efficient estimates of population location and dispersion, even where a modest proportion of the data is of questionable accuracy. The median is a robust estimate of location as it is insensitive to a few outlier data, given that there are about equal numbers of high and low values. Examination of the distribution of analytical results for many sera, such as the histograms in Figure 2, revealed that most results cluster rather tightly together in a central “core group”, with a sizable minority of results scattered from quite low to quite high relative to this group. The interquartile range (IQR, the range encompassing the central 50% of the data) is likewise a robust estimate of dispersion. For a Gaussian population, the median and the mean are identical, and 0.741 times the IQR is equal to the SD.24 To facilitate comparisons, we define 0.741IQR as the “estimated standard deviation” (eSD). As shown in Figure 2, Gaussian curves parametrized with the median and eSD effectively summarize the core of the observed measurement distributions. The full measurement population median and eSD agreed very well with nearly all the censored data mean and SD values; for the few sera where the originally reported censored statistics did not agree with the proposed percentile-based analogs, the original data censoring was found to have been incomplete. (24) Stuart, A., Ord, J. K., Eds. Kendall’s Advanced Theory of Statistics, 5th ed.; Oxford University Press: New York, 1987; Vol. 1, Chapter 10, Section 11.
are not constant but remain a function of analyte level. However, we can define a level-dependent “estimated CV” (eCV) for any given analyte level:
eCV ) 100 ×
Figure 2. Relative frequency of occurrence of results for a representative serum. These curves display representative frequency histograms and two Gaussian summary distributions for total β-carotene results for two samples analyzed in the same exercise (RR XXIX). The dotted lines denote distributions using data mean and SD; the dark lines denote distributions calculated from the data median and eSD. Each histogram’s bin width is equal to 0.5eSD. All curves have been jointly scaled so that the most populated histogram bin has unit height.
Short-Term Reproducibility as a Function of Analyte Concentration. We have evaluated the interlaboratory reproducibility eSD as a function of analyte concentration. Figure 3 displays the relationships between the observed median and eSD for R, TβC, RT, γT, RC, Ly, βCr, and RP. The data are broken into three approximately equal time periods: 1985-1988, 1989-1992, and 1993-1995. While some of the earliest data are discordant, the relationships between median and eSD for each analyte are very similar among all three periods. Log10(eSD) appears to be a linear function of log10(median) over much of the concentration range of all analytes except RP. For all analytes except Ly, eSD approaches a constant limiting value at low concentration. This interlaboratory low-concentration asymptote marks the analyte level at which intermethod biases become dominant. We term this the “limit of quantitative comparison” (Lqc). The lack of an apparent Lqc limit for Ly indicates a need for more samples with low Ly concentration. The virtually constant eSD for all levels of RP suggests remarkable disparity among methods for this analyte. Samples with low Ly and/or high RP levels are needed in order to fully characterize the reproducibility characteristics for these analytes. The lines in Figure 3 represent models for each analyte that combine the constant eSD at and below Lqc and the logarithmically linear relationship at higher concentrations:
eSD ) xLqc2 + (β0[analyte]β1)2 where [analyte] represents the concentration of the analyte in question, and β0 and β1 are regression parameters for the loglinear component of the curve. Table 1 lists the analyte-specific quantitative values for Lqc, β0, and β1 derived using just the most recent third of the data. Note that β1 is not unity for many analytes; even above the Lqc limit, interlaboratory measurement coefficients of variation (CV)
xLqc2 + (β0[analyte]β1)2 eSD ) 100 × [analyte] [analyte]
The utility of the regression models to reproduce the data used to “train” the model is indicated as the one standard deviation of the regression residuals, with the additive (log10(eSD) value transformed to the more easily interpreted multiplicative *‚ eSD equivalent. The utility of the models for predicting the earlier data is likewise indicated. For most analytes, the prediction and recognition residuals are of very similar magnitude. Figure 4 summarizes the expected interlaboratory eCV as a function of analyte concentration over the entire adult concentration range,17 with the U.S. population median and central 90% range also indicated.25 Only for R and RT is eCV approximately constant over the entire range of analytical interest. This constancy is a result of at least four factors: (1) the physiological range for these actively regulated vitamins is relatively narrow,17 (2) these ranges are well above the Lqc, (3) both analytes are relatively easily chromatographically resolved,6 and (4) both analytes are efficiently extracted and are reasonably stable. Changes in Short-term Reproducibility with Time. Although measurement performance changes over time cannot be experimentally determined (no serum has been analyzed in more than four separate QAP exercises), changes in short-term reproducibility can be indirectly determined using the regression models for eCV as functions of analyte concentration. Since eCV is not constant, all individual sera summaries must first be standardized to represent some given concentration, [analyte]s:
eCVs ) 100 ×
xLqc2 + (β0[analyte]sβ )2 1
eSD
xLqc2 + (β0 × medianβ )2
[analyte]s
1
These individual serum eCVs values can be pooled into a single estimate for each exercise:
eCVs,RR )
x
NRR
∑eCV
2 νj
/nRR
i)1
where nRR is the number of samples analyzed in the given round robin. Figure 5 displays the one-year-wide moving average of the eCVs,RR for RR VII-XXXVI, using an observed adult population median concentration as [analyte]s. The best reproducibilities for R, RT, and TβC were achieved during the 1988-1989 period, when control materials were supplied to participants at no charge. We speculate that many participants calibrated their systems to the control materials, providing exceptional reproducibility at the expense of independent estimation of analytical accuracy. Reproducibilities for R and (25) Unpublished NHANES III Reference Data for Total U.S. Population, Four Years and Older (Excluding Pregnant Women), NCHS, 1996. Presented at Experimental Biology ‘96, Washington, DC, April 17, 1996.
Analytical Chemistry, Vol. 69, No. 7, April 1, 1997
1409
Figure 3. Dependence of dispersion on location: eSD as a function of concentration. These plots display the dependencies of eSD (0.741IQR) on the median analyte concentration estimate. The data have been divided into three approximately equal-sized groups of all data taken during 1986-1989 (“1”), 1990-1992 (“2”), and 1993-1995 (“3”). The lines represent least-squares regression to the model eSD ) [Lqc2 + (β0 × medianβ1)2]1/2, using just the most recent third of the data. Each axis of each scattergram spans the same 120-fold relative concentration range. The plot for β-carotene shows both total (heavy line) and trans (light line). 1410 Analytical Chemistry, Vol. 69, No. 7, April 1, 1997
Table 1. Long-Term Reproducibility as a Function of Analyte Concentrationa eSD ) xLqc2 + (β0[analyte]β1)2 analyte
code
Lqc
β0
trainingb β1
n
*SDd
evaluationc n
‚ retinol R-tocopherol γ-tocopherol total β-carotene trans-β-carotene retinyl palmitate total R-carotene β-cryptoxanthin total lycopene
R RT γT TβC tβC RP RC βCr Ly
0.020 (0.003) 0.41 (0.06) 0.10 (0.03) 0.007 (0.003) 0.006 (0.003) 0.024 (0.003) 0.003 (0.002) 0.002 (0.003) 0
0.088 (0.005) 0.03 (0.02) 0.07 (0.03) 0.129 (0.008) 0.09 (0.01) 0.2 (0.5) 0.24 (0.01) 0.149 (0.006) 0.21 (0.02)
1.17 (0.07) 1.3 (0.2) 1.0 (0.3) 0.80 (0.07) 1.0 (0.2) 1 (2) 0.96 (0.02) 0.8 (0.1) 0.86 (0.05)
39 39 39 39 38 34 39 39 39
1.2 1.3 1.4 1.3 1.8 1.6 1.4 1.3 1.3
76 76 36 76 32 15 20 20 36
*‚SDd 1.4 1.5 1.7 1.6 3.2 3.5 1.9 2.1 1.6
a Concentrations in µg/mL. b Results for samples distributed in RR XXVII-XXXV were used to “train” the least-squares regression. c Results for samples distributed in RR VIII-XXVI were used to evaluate the predictive utility of the regression models. d The concentration-space “standard deviation” is an asymmetric factor, with *‚SD spanning the region from concentration/SD to concentration*SD.
Figure 4. Dependence of expected coefficient of variation on analyte concentration. Curves are displayed for eCV as calculated from all eSD regression models shown in Figure 3. Dashed lines span the entire range of analyte concentrations encountered during analysis of U.S. adult human sera,17 and the solid lines span the central 90% of these measurements; the median of these measurements is denoted “b”.
RT have gradually declined following their 1989 optimum. After a period of decline, TβC reproducibility started improving at about the same time that other carotenoids began to be reported by a significant proportion of participants (∼1991). The enhanced separation technology required for carotenoid speciation may contribute to this recent improvement. Short-term reproducibilities for RP, βCr, and Ly have steadily improved over time. Values for newly determined analytes are typically first reported to the QAP by a single laboratory, followed by a period where several laboratories report values that are often contradictory. Shortly after consensus values appear to be established, a relatively large number of laboratories start report-
Figure 5. Changes in short-term reproducibility over time. Oneyear running averages for eCV, standardized to the value expected for the median analyte concentration encountered during analysis of U.S. adult human sera.17 The period of control material distribution and the approximate dates of SRM 968 availability are indicated above the time axis.
ing values for these new analytes in subsequent exercises (see Figure 1a). We believe that this pattern of expanded analytical capability and measurement reproducibility improvement is directly linked to QAP participation.16 Short- and Long-Term Repeatability. Comparison of individual laboratory repeatabilities provides one measure of the sustained relative measurement competence of the participating laboratories to generate consistently accurate results. The five sera that have been distributed as same-exercise blind replicates provide a basis for estimating short-term repeatability. The 54 sera that have been distributed in two or more different exercises provide a basis for estimating long-term repeatability. Because analyte concentrations differ widely among different sera, repeatability SD for different sera must first be normalized to some nominally common basis. Normalization to the observed eSD of the serum expresses repeatability as a fraction of reproducibility: Analytical Chemistry, Vol. 69, No. 7, April 1, 1997
1411
Fjl )
x
∑(x
SDjl ) eSDj
nl
ijl
- jxjl)2/(njl - 1)
i)1
nl
∑eSD
2
ij
/njl
i)1
where SDjl is the repeatability SD for a given laboratory for a given serum j, eSDl is the reproducibility for serum j, njl is the number of QAP exercises in which laboratory l participated, xijl are laboratory l’s reported values in each exercise i, jxjl is the mean of laboratory l’s values, and eSDij is the observed reproducibility SD for serum j in exercise i. For those sera distributed in three or more exercises, evaluation of the replicates as nl(nl - 1)/2 duplicates permits identification of the fractional repeatability with a defined time separation. Figure 6 presents the repeatability, Fjl, as a function of the time separation for three laboratories and for the interlaboratory median for R, RT, and TβC. All three of the laboratories (designated A, B, and C) have participated in more than half of the QAP exercises. Their results represent the very best, “good average”, and “poor average” repeatability performance, respectively. The repeatability characteristic of laboratory A is nearly as good as that of the interlaboratory median and may thus provide an estimate of the intrinsic sample homogeneity. While the pooled Fjl values are of similar magnitude for all three analytes for each laboratory, the magnitudes are markedly different for the different laboratories. Sadly, we do not have the relevant information for exploring the causes of these dramatic differences. The short-term fractional repeatability Fjl values shown in Figure 6 (the symbols at duration zero) are considerably smaller than are the long-term Fjl values (shown as running-average curves from duration 0.1 to 4 years). There is no clear trend in Fjl as a function of the definition of “long-term” for these three laboratories nor for any other laboratory, although there are patterns in the Fjl values for some laboratories that appear related to the actual dates of analysis. These patterns doubtless reflect specific changes in equipment, methods, and/or personnel and occur despite all within-organization quality control/assurance efforts. To a good approximation, it appears that all Fjl values for analyses performed for different QAP exercises can be pooled into a single “long-term” fractional repeatability regardless of the actual time separation. Figure 7 presents distributions over all individual laboratories of pooled short-term and long-term repeatabilities for R, RT, and TβC, where the long-term repeatability distribution is for all different-exercise Fjl values pooled into a single durationindependent summary. The median short-term repeatability is less than one-third of the interlaboratory reproducibility for all analytes; the median long-term repeatability is as large as or larger than the reproducibility. While a few specific laboratories provide remarkably stable results over very long time periods, repeatability for the “average” laboratory is no more stable over time than is the interlaboratory reproducibility. The unimodal, approximately log-Gaussian distributions of the short- and long-term fractional repeatabilities suggest that a laboratory’s state of analytical control varies continuously from “best possible” to “random”, not dichotomously from “in-control” to “out-of-control”. For analyte concentrations well above the Lqc, the IQR-based eSD (the most consistent 50% of the data) estimate 1412 Analytical Chemistry, Vol. 69, No. 7, April 1, 1997
Figure 6. Repeatability as a fraction of reproducibility for three QAP participants. The average repeatability/reproducibility for R, RT, and TβC are shown for three experienced laboratories (denoted A, B, and C) as a function of the time separation between analyses. Short-term fractional repeatabilities are given by the lower-case laboratory symbols at time zero. The dark solid line represents the running average of the long-term fractional repeatability of laboratory A, the laboratory with the best overall reproducibility of all QAP participants. The short-dashed line is the running average for laboratory B, which ranks about 25% in repeatability. The long-dashed line represents laboratory C, which ranks about 75%. The light solid line represents the fractional repeatability characteristic of the interlaboratory median.
of short-term interlaboratory reproducibility may well directly estimate “average laboratory” long-term intralaboratory repeatability. If true, improving measurement comparability among laboratories will require improving “control” within individual laboratories. Serum Stability and Homogeneity. Since intrinsic or storage-induced sample heterogeneity will masquerade as measurement variability, all sample populations must be intrinsically homogenous and unchanged by storage for the above observations to be valid. Serum homogeneity and stability is ensured through NIST measurements and studies, including multisample NIST analysis of each sample distributed, multiyear stability testing of all SRM materials,15 a long-term study of a single serum stored since 1985 at the same -80 °C conditions used for other QAP samples,22 and comparison of results for sera distributed in two or more QAP exercises. CONCLUSIONS Comparison of analytical results from multiple interlaboratory comparison exercises can help characterize aspects of measure-
knowledge of the relationship between analyte level and expected measurement variation, and reliable methods for summarizing the location and dispersion characteristics of a given set of measurements. The percentile-based median and eSD provide suitably robust estimates of analyte level and short-term measurement repeatability. For most of the fat-soluble analytes studied in the QAP, there is a linear relationship between concentration and eSD at high concentration that smoothly approaches an asymptotic limit at low analyte concentration. “Analytical control” is better regarded as a relative and not an absolute concept. While there are outstanding exceptions, longterm study measurements generated within a single laboratory are not necessarily more internally consistent than are results from multiple laboratories. Better and more consistently implemented intralaboratory standardization, quality control, and quality assurance methods are required to improve interlaboratory measurement comparability.
Figure 7. Short- and long-term fractional repeatability distributions of QAP participants. These curves display the distribution of pooled short- (light lines) and long-term (dark lines) repeatabilities for R, RT, and TβC. The distributions of the individual laboratory repeatabilities are shown as histograms and as log-Gaussian curves. All curves in each plot have been jointly scaled so that the most populated histogram bin has unit height.
ment performance that are often neglected, particularly long-term measurement repeatability. Valid comparisons can be made even when identical samples are not used in every exercise, given homogenous and stable samples that are of similar nature,
ACKNOWLEDGMENT This work has been supported in part by the Division of Cancer Prevention and Control, National Cancer Institute, National Institutes of Health. We thank Emil Scho¨nberger, Lane C. Sander, and Neal E. Craft for their analytical insights and analyses of samples; Robert C. Paule for his analyses of the early data; Dennis J. Reeder for his involvement in the design and implementation of the early exercises; and Charles Boone, Winfred Malone, and Herbert F. Pierson for their support of the Micronutrients Measurement Quality Assurance Program within NCI. We particularly wish to thank all analysts who have participated in the interlaboratory QAP exercises and workshops. Their participation, enthusiastic support, and critical feedback are crucial to the continued improvement of interlaboratory measurement comparability. Received for review July 30, 1996. Accepted January 23, 1997.X AC9607727 X
Abstract published in Advance ACS Abstracts, March 1, 1997.
Analytical Chemistry, Vol. 69, No. 7, April 1, 1997
1413