Micronutrients Measurement Quality Assurance Program: Helping

Mar 20, 1999 - Over the past decade, the Micronutrients Measurement Quality Assurance Program (M2QAP) at the National Institute of Standards and ...
0 downloads 0 Views 165KB Size
Anal. Chem. 1999, 71, 1870-1878

Micronutrients Measurement Quality Assurance Program: Helping Participants Use Interlaboratory Comparison Exercise Results To Improve Their Long-Term Measurement Performance David L. Duewer,* Margaret C. Kline, Katherine E. Sharpless, Jeanice Brown Thomas, and Kenneth T. Gary

Chemical Science and Technology Laboratory, National Institute of Standards and Technology, Gaithersburg, Maryland 20899 Anne L. Sowell

Nutritional Biochemistry Branch, Centers for Disease Control and Prevention, Atlanta, Georgia 30341

Over the past decade, the Micronutrients Measurement Quality Assurance Program (M2QAP) at the National Institute of Standards and Technology (NIST) has administered nearly 40 interlaboratory comparison exercises devoted to fat-soluble vitamin-related analytes in human serum. While M2QAP studies have been used to help certify reference materials and to document the performance of analytical systems, the primary focus of the M2QAP has been, and remains, the improvement of amongparticipant measurement comparability for target analytes. Recent analysis of historical measurement performance indicated the most efficient mechanism for further improving measurement comparability among participants is the improvement of long-term (months to years) comparability within each laboratory. The summary reports for the M2QAP studies are being redesigned to provide more chemist-friendly analyses of participant performance, dissecting systematic and random components of measurement incomparability as functions of analyte level and time. This report documents the semantic and graphical tools developed to help interlaboratory-comparison-exercise participants interpret their own measurement performance. The National Cancer Institute (NCI) and what is now the National Institute of Standards and Technology (NIST) established the Micronutrients Measurement Quality Assurance Program (M2QAP) in 1985. From 1986 through 1997, the M2QAP conducted 41 interlaboratory comparison exercises; in 37 of these, 20-60 participants assayed one or more fat-soluble vitamin-related analytes in 2-6 samples of lyophilized human serum. Details of the number and nature of the analytes reported, the sample materials used, and the laboratories participating in this program have been presented previously.1 * Corresponding author: (tel.) (301) 975-3935, (fax) (301) 977-0587, (e-mail) [email protected].

1870 Analytical Chemistry, Vol. 71, No. 9, May 1, 1999

The M2QAP studies have had the multiple immediate goals typical of interlaboratory comparison exercises: assessing the performance characteristics of methods of analysis (method performance), assessing the performance of participating laboratories (laboratory performance), and assigning most probable concentration values of an analyte in a material (material certification).2 Again, as typical of all interlaboratory studies, the primary motivation for the M2QAP studies has been to help “participants gain insight into their own performance and into the comparability of analytical results.”3 It was thus with some surprise and concern that our recent retrospective characterization of community-wide measurement performance concluded that “Better and more consistently implemented intralaboratory standardization, quality control, and quality assurance methods are required to improve interlaboratory measurement comparability.”1 The M2QAP community is quite diverse, spanning academic, commercial, and regulatory groups; clinical, nutritional, and epidemiological concerns; and U. S. and international interests. Given that participation is voluntary and that the different community members have quite different measurement needs and resources, external imposition of specific quality assurance methods is both imprudent and technically impractical. Continued participation, itself, in the voluntary program, suggests that the involved analysts (and/or their supervisors) wish to make quality measurements and are willing to modify their methods as problems are identified. However, we believe that many laboratory analysts lack the expertise to convert patterns in numbers into testable chemical hypotheses. There is considerable literature, including international standards, on experimental design, outlier identification, statistical analysis, and nomenclature aspects for all types of interlaboratory (1) Duewer, D. L.; Brown Thomas, J.; Kline, M. C.; MacCrehan, W. A.; Schaffer, R.; Sharpless, K. E.; May, W. E.; Crowell, J. A. Anal. Chem. 1997, 69, 14061413. (2) Horwitz, W. Pure Appl. Chem. 1994, 66, 1903-1911. (3) Gu ¨ nzler, H., Ed. Accreditation and Quality Assurance in Analytical Chemistry; Springer: Berlin, 1996. 10.1021/ac981074k Not subject to U.S. Copyright. Publ. 1999 Am. Chem. Soc.

Published on Web 03/20/1999

comparison exercises.3-8 Few of these guides specifically address how to help analysts use interlaboratory results to improve their measurement systems. Statistical dissection of systematic and random sources of measurement error can provide much useful information, but these numbers do not yield insight if they are not successfully communicated. We have developed prototypical graphical reporting techniques intended to provide analysts with more chemically meaningful information. We have borrowed freely from Tufte’s graphical design concepts 9 and from Youden plots,10 box plots, and other exploratory data analysis techniques that use the “native language” of the measurement to aid intellectual digestion of large amounts of data.11,12 We here present the analysis tools developed to help individual analysts interpret their own results for a given interlaboratory study, including nomenclature and definitions, detailed rules for constructing several graphical displays, and interpretation of selected examples. These tools may be useful to other programs that involve (1) multiple interlaboratory studies, particularly if conducted on a regular schedule, (2) multiple samples of qualitatively similar matrix and analyte composition but that have different analyte levels, and (3) a fairly large, slowly evolving participant community. NOMENCLATURE AND DEFINITIONS A formal nomenclature has been developed for describing performance characteristics of a chemical measurement process for method-performance and material-certification evaluations.13 In these contexts, interlaboratory comparison exercises are used to provide “forest not the trees” statistical summaries to sponsors well-versed in chemical metrology. Neither this focus on composite results nor the formal terminology is well-suited to performance-improvement comparison exercises, where the primary audience is the individual analyst. We use a looser and more natural language terminology for communicating “tree within the forest” information to practicing analytical chemists. Comparability: “Accuracy” is an often-defined metrological concept having essence “closeness to (accepted) truth.”13,14 For the lyophilized serum samples used in the M2QAP, and probably for all natural-matrix samples and/or data from measurement systems that do not achieve complete separation and/or absolute detection selectivity, “truth” is an operationally defined concept. Most trace organic “analytes” in human serum are composites of (4) International Organization for Standardization. Guide 35: Certification of Reference MaterialssGeneral and Statistical Principles; Geneva, Switzerland, 1989. (5) International Organization for Standardization/International Electrotechnical Commission. Guide 43-1: Proficiency Testing by Interlaboratory Comparisons. Part 1. Development and Operation of Proficiency Testing Schemes; Geneva, Switzerland, 1996. (6) Horwitz, W. Pure Appl. Chem. 1995, 67, 331-343. (7) Thompson, M.; Wood, R. Pure Appl. Chem. 1993, 65, 2123-2144. (8) Egan, H., West, T. S., Eds. Collaborative Interlaboratory Studies in Chemical Analysis; Pergamon: Oxford, 1981. (9) Tufte, E. R. The Visual Display of Quantitative Information; Graphics Press: Cheshire, CT, 1983. (10) Youden, W. J. Industrial Quality Control 1959, XV, 1-5. (11) Tukey, J. W. Exploratory Data Analysis; Addison-Wesley: Reading, MA, 1977. (12) Spring, C. B.; Pielert, J. H.; Leigh, S.; Heckert, N. A. Graphical analysis of the CCRL portland cement proficiency sample database (samples 1-72); NISTIR 5387; U.S. Government Printing Office: Washington, DC, 1994. (13) Currie, L. A. Pure Appl. Chem. 1995, 67, 1699-1723. (14) ASTM, ASTM Committee on Terminology. Compilation of ASTM Standard Definitions; Philadelphia, PA, 1994.

multiple chemical entities; different measurement systems may composite these entities in different ways. Further, the periodic presence of (unintentional) interferents in interlaboratory samples makes assigning even “accepted values” to analytes in routinenatural-matrix samples philosophically dubious, technically difficult, and economically prohibitive. It is also unnecessary when the goal is measurement comparability, not accuracy. “Comparability” is an often-used common-sense concept that is seldom formally defined.15 For the M2QAP, we define comparability for a given measurement as its difference from the consensus value where xijl is measurement i of sample j reported

∆ijl ) xijl - X hj

(1)

by participant l and X h j is the interlaboratory consensus value for h j for each analyte as the median sample j. The M2QAP defines X of all non-NIST values (see note at end of this section). In an interlaboratory performance-improvement context, this focus on the consensus result without regard to the “truth” of that consensus is philosophically and pragmatically satisfying. Programmatic resources can be devoted to helping all participants make more similar measurements while minimizing much of the emotional reaction tosand economic punishment ofshonest differences among analysts. Concordance: Measurement accuracy implies knowledge and control of systematic and random sources of error. Closeness of the average value of a series of measurements to truth is termed “trueness”5 or “lack of bias.”13 By analogy, measurement comparability can likewise be decomposed into systematic and random components; the M2QAP term for systematic “closeness to consensus” is “concordance.” While differently defined in statistical-ranking theory,16 concordance has an appropriate natural-language interpretation with a mildly positive connotation. “Discordance” is a simple opposite, and concordance and comparability have the same linguistic direction (better concordance, better comparability). In contrast, bias has a strongly negative connotation; bias and accuracy have opposite directions, and unwieldy (“less biased”) or untrue (“unbiased”) opposites. Trueness has the proper direction and an emphatically positive connotation, but unwieldy (“less trueness”), untrue (“falseness”), or unconnected (“bias”) opposites. While perhaps trivial, these natural-language semantic deficiencies hinder data-analyst-to-laboratory-analyst discourse. While comparability is defined for a single measurement, concordance is a statistical property of multiple measurements. The detailed interpretation of concordance depends on what is meant by “multiple measurements.” Given replicate measurements on a single sample, the concordance for that sample for a given participant is the average comparability where Njl is the number Njl

Cjl ) (

∑∆

ijl)/Njl

(2)

i)1

of replicate measurements on sample j made by participant l. Since (15) ASTM, ASTM Committee D-22. D 4430-96: Standard Practice for Determining the Operational Comparability of Meteorological Measurements. Annual Book of ASTM Standards, Vol. 11.03; West Conshohocken, PA, 1997; pp 277-279.

Analytical Chemistry, Vol. 71, No. 9, May 1, 1999

1871

M2QAP participants provide just one result per analyte per sample, sample concordance can be differentiated from measurement comparability only for those sample materials that have been distributed more than once. However, a participant concordance can be estimated by standardizing comparabilities for a set of samples where Nl is the number of samples quantitatively analyzed

( ) Nl

Cl )

Cjl

∑S

j)1

/Nl

concordance calculation

h j)) Sj ) Max(Sˆ j, S(X

To ensure that a single “wild” value does not dominate the calculated concordance, all standardized comparabilities are bounded by ( Q∞:

()

(3)

Sˆ j ) 0.741 × IQRj

h βj 1)2 S(X h j) ) xL2qc + (β0 × X

(5)

Lqc (the “lower limit of quantitative comparison”) and β0 and β1 are analyte-specific empirical constants.1 To ensure that the participant concordance for a particular set of samples is not dominated by an atypically small Sˆ j, the M2QAP uses the maximum of the estimated and calculated Sj values in the (16) Marriott, F. H. C. A Dictionary of Statistical Terms, 5th ed.; John Wiley: New York, 1990. (17) ASTM Committee E-36. E 1301-95: Standard Guide for Proficiency Testing by Interlaboratory Comparisons. Annual Book of ASTM Standards, Vol. 14.02; ASTM: West Conshohocken, PA, 1997; pp 809-822. (18) Stuart, A., Ord, J. K., Eds. Kendall’s Advanced Theory of Statistics, Volume 1, 5th ed.; Oxford University Press: NY, 1987; Chapter 10, Section 11. (19) ASTM Committee E-1. E 1763-95: Standard Guide for Interpretation and Use of Results from Interlaboratory Testing of Chemical Analysis Methods. Annual Book of ASTM Standards, Vol. 3.06; ASTM: West Conshohocken, PA, 1997; 582-593.

1872 Analytical Chemistry, Vol. 71, No. 9, May 1, 1999

( ))

(7)

Since there is only about a 1:1000 chance of a “good” measurement being by chance outside 3 Sj from the center of a normal distribution, we typically assign Q∞ a value of 3. Precision: Closeness of agreement among series of replicate measurements is termed “precision”13 and implies control of random sources of measurement error. Operationally, precision is estimated as the standard deviation of a set of measurements after adjustment for systematic differences. Precision is thus as appropriate for defining comparability as it is for defining accuracy; it has the appropriate natural-language interpretation and positive connotation; “imprecision” is a simple opposite, and precision, accuracy, and comparability have the same linguistic directions. As with concordance, precision is a statistical property of multiple measurements whose interpretation depends on the experimental design. Single-sample precision for a given participant is the concordance-adjusted standard deviation over replicate measurements:

Pjl )

(4)

is the observed Inter-Quartile Range (the range spanned by the central 50% of the non-NIST values). The Sˆ j values for all the routinely reported fat-soluble vitaminrelated analytes follow a generalization of the usual functional dependence of among-participant standard deviation on X h j19 where

(

Cjl Cjl * ) Max -Q∞, Min Q∞, Sj Sj

j

by participant l, and Sj is the among-participant standard deviation observed or expected for each sample. This transforms each individual sample’s distribution of comparability values to the same nominal distribution centered on zero with unit variance (“zscoring”).17 It also converts the units of comparability from “concentration difference” to “number of Sj” from the consensus value. The sine qua non condition for valid chemical data and statistical interpretation is that all measurement processes are in “statistical control.”13 However, at least for the measurement systems employed by M2QAP participants, control is not a binary “in” or “out” dichotomy but varies continuously from “!” through “??” to “@#$%!”1 The conceptual model for all M2QAP analytes is that there are two data distributions: a majority of values roughly normal about the consensus value and a minority roughly uniform from zero to well above the maximum literature value for the analyte in human serum. We therefore use robust nonparametric statistics to estimate what Sj “should be” if all the results for sample j were from a single normal distribution as1,18 where IQRj

(6)

x

Njl

(

∑ (∆

ijl

- Cjl)2)/(Njl - 1)

(8)

i)1

This can be generalized to multiple samples by standardizing each single-sample precision to Sj and pooling the results:

x(∑ ∑ ( ) ) ∑ Nl

Pl )

Njl

j)1 i)1

∆ijl - Cjl Sj

Nl

2

/(

(Njl - 1))

(9)

j)1

The interpretation of Pl strongly depends on the nature of the replication. If only “blind replicate” samplesstwo or more samples of the same serum distributed as different unknowns in the same studysare combined, then Pl is a short-term (Pshort ) precision l estimate.1 If only replicates from different studies are combined, then Pl represents long-term (Plong ) precision.1 Since short-term l measurement stability is a necessary but not sufficient condition for long-term stability, Plong should always be as large or larger l than Pshort . l Apparent Precision: The above precision estimates can be used only with materials that have been distributed more than once. While 26 of the 37 M2QAP interlaboratory studies have included at least one previously analyzed material, only 16 have included two or more such materials. Even for interlaboratory studies using sufficient “old” samples for reliable among-sample precision estimation, estimates can be made only for those laboratories that provided data in the requisite previous studies.

An “apparent precision” can be estimated for all participants in every study by referencing values to the among-sample concordance as well as standardizing to Sj:

APl )

x(∑( ) ) Nl

∆jl

j)1

Sj

2

- Cl

/(Nl - 1)

(10)

While having the same natural-language interpretation and direction as precision, APl is a composite of “true” short-term measurement precision characteristics for individual samples and of sample-specific discordances such as may arise when different measurement systems that have different selectivity and/or sensitivity characteristics are challenged with atypical samples. Since APl includes all of the components of variance estimated by Pshort , APl should be as large or larger than Pshort . However, l l since long-term precision can be degraded by sample-independent changes in a measurement system, there is no necessary relationship between APl and Plong . l To enable comparison of long-term precision estimates across different experimental designs, Plong is calculated using the l average value for any blind replicates. While important only when measurements are unusually discordant, both Pl and APl are most appropriately calculated with “unbounded” concordances. To ensure that all comparability terms can be displayed on the same scale, it is convenient to bound Pl and APl to have a maximum value of Q∞. Note: Several reviewers have questioned the exclusion of NIST results from the statistical calculations described above. The policy of separately reporting NIST results is intended to “add value” to the M2QAP by providing participants with completely independent reference values from a known source. Typically, two NIST analysts separately analyze multiple vials of each serum in duplicate. These analysts use different analytical protocols and equipment; they do not share quantitative information about the samples and try very hard not to have knowledge of any participant data (occasionally they have it thrust upon them when a participant requests analytical advice). The average of each analyst’s values for every analyte that they measure is included in the “feedback report” for each exercise; these are the only results that are reported with Institutional attribution. The combined results from both analysts are used to estimate intrinsic material heterogeneity and various analysis-related components of variance for each serum. These statistics are reported in (perhaps excessive) detail in the feedback reports. GRAPHICAL TOOLS Summary Box Plots: The first concern for any participant in any interlaboratory study is “How well did I do?” When only natural-matrix samples are analyzed, this is identical to “How well did I do relative to the other participants?” Figure 1 is the “summary box plot” that addresses this question for one participant for one analyte in a recent M2QAP study. The dot and box for each sample display the participant’s data relative to the distribution of the central 50% of all participants’ data. Discordance manifests as values uniformly above or below the sample medians; imprecision manifests as some values above and some below the median. When the participant has reported data for previous

Figure 1. Summary Box Plot. The box for each sample displays the 75th, 50th (median), and 25th percentile of quantitative data reported for a given analyte for a given sample in a particular interlaboratory study. The top of the box denotes the 75th percentile, the bottom of the box denotes the 25th, and the solid line within the box denotes the 50th. The participant’s measurement is displayed as a solid circle. Relationships among the samples, such as short-term (sera 223 and 225) and long-term replication (sera 202, 210, and 227 and sera 223, 225, and 228) are indicated with dashed lines. Sample identification numbers and approximate study dates are normally displayed below each box. Box plots for all analytes in a given study have the same graphical structure; the details of that structure (such as the location of the concentration axis, the order of unrelated samples, and the location of the sample labels) differ according to the particular experimental design. This box plot displays retinol measurements for M2QAP XXXIX (and prior replicate samples) reported by the CDC.

studies, it is sometimes possible to differentiate “true” from “apparent” imprecision by the extent of self-agreement with the older data. To enable approximately balanced graphical display of the lowest to the highest analyte-level samples, analyte concentrations are graphed on a logarithmic vertical axis spanning the range NS

Xmax ) Max{ | X h j + 3 × S(X h j)} j)1

NS

NS

j)1

j)1

h j} × Max{ | X h j} ÷ Xmax Xmin ) Min{ | X

(11a)

(11b)

where NS is the number of samples distributed to each participant in the particular interlaboratory study. This definition for Xmin prevents too much graphical space being allocated to samples with analyte levels at or near the Lqc (eq 5). Any data above or below these boundaries are graphed at the appropriate boundary. Major and minor decade tick marks are displayed to provide a quantitative frame of reference, but tick-mark labels are not displayed. In these compact summaries, such labels are visually distracting and detract from the display of relationships among the data. Complete (numerical) data and summary statistics are provided to each participant in the interlaboratory study as a separate document. Analytical Chemistry, Vol. 71, No. 9, May 1, 1999

1873

Figure 2. Youden Comparability Plot. Each symbol denotes the participant’s measurement for a given analyte in a given sample plotted against the interlaboratory consensus value. The solid symbols denote measurements of the current samples; the open symbols denote data from the preceding nine studies. The solid lines bound (1 × S(X h j) about the expected equality line; the dotted lines bound (3 × S(X h j) about equality. This comparability plot displays retinol measurements for M2QAP XXXIX (and XXIX-XXXVIII) reported by the CDC.

The horizontal axis is dimensionless, although the width of each box reflects the relative number of fully quantitative values reported. The relative position of and the spacing among the sample boxes is arbitrary but is constrained by graphical aesthetics, resolution, and relationships among the samples. The location of the vertical concentration axis along the horizontal axis is often used to visually separate unrelated samples. The among-sample relationships thus far addressed are shortand long-term sample replication, dilution series (a given base serum pool diluted with known volumes of delipidized serum1), and blends (known-volume combinations of different base pools). Inclusion of replicate data from earlier studies provides some insight into the stability of the samples distributed, as well as into the stability of the participant’s measurement processes. Youden Comparability Plots: While the summary box plot can suggest functional differences in measurement performance related to analyte level, the samples distributed in a given study often do not cover the full range of analytical interest. Further, since the box plot is focused on the current study’s samples, it provides little insight into the participant’s measurement comparability over time. Figure 2 presents a Youden comparability plot that directly explores the relationship between comparability and concentration for the data displayed in Figure 1 plus that participant’s data for the nine immediately preceding studies. Comparability plots typically present data for one analyte, although closely related analytes (such as trans-β-carotene and total β-carotene20) can be compactly and efficiently presented as an average. The participant’s xijl measurements (vertical axis) are (20) Sharpless, K. S.; Duewer, D. L. Anal. Chem. 1997, 67, 4416-4422.

1874 Analytical Chemistry, Vol. 71, No. 9, May 1, 1999

Figure 3. Concordance/Apparent Precision “Target” Plot. The horizontal axis represents concordance (Cl, eq 3); the vertical axis represents apparent precision (APl, eq 11). Perfect comparability is at the bottom center of the target. The dotted semicircular lines are at 1×, 2×, and 3 × Sj. The open circle denotes a particular (Cl,APl) for a given analyte in a given study, the open diamonds denote NIST analysts’ performance, and the + signs denote the performance of all other participants. The solid circles to the left denote the participant’s APl in the previous nine studies, plotted in time sequence from left to right. The solid circles at the top similarly denote the participant’s Cl in the previous nine studies, plotted in time sequence from top to bottom. Historical (Cl,APl) estimates from sequential studies are connected with light solid lines; intermittent participation is apparent when neighboring symbols are not connected. This target plot displays retinol measurements for M2QAP XXXIX (and XXIXXXXVIII) reported by the CDC.

plotted as a function of the interlaboratory X h j (horizontal axis), along with lines demarcating (3 × and (1 × the expected S(X h j). Discordance manifests as asymmetric clustering of the data above or below the central lines; apparent imprecision manifests as symmetrical scatter near or beyond the outer lines. A functional dependence of systematic and/or random-measurement incomparabilities on analyte concentration manifests as a progressive change in the scatter pattern with concentration. Like the box plot concentration axis, both axes are logarithmic, span the concentration range defined by eqs 11a and 11b (albeit with NS equal to the number samples analyzed in all ten studies), and display major and minor decade tick marks without labels. The data from earlier M2QAP studies are displayed as open symbols, allowing identification of “past” and “present” measurement performance. The number of immediately preceding studies chosen for display is a function of the number of data, the average concentration range spanned per sample set, and the need to limit the display to reasonably relevant measurement performance. Four samples are now distributed per typical M2QAP study; we find that the 40 samples provided by 10 studies cover the concentration range of interest for most analytes without great loss of individuality at the graphical resolution employed. While “relevance” is participant-specific, NIST analysts suggest that “about three years” is as long as anyone should be haunted by the ghosts of interlaboratory comparisons past. Concordance/Apparent Precision “Target” Plots: Both box and comparability plots qualitatively indicate systematic and random variation about the consensus values; neither does so quantitatively. Figure 3 details Cl and APl for the data displayed in Figures 1 and 2; for convenience, we often refer to this “Concordance/ Apparent Precision” tool as a “target plot.” Concordance is plotted

along a horizontal axis spanning -Q∞ to +Q∞; apparent precision is plotted along a vertical axis ranging from 0 to Q∞. The distance of the point defined by {Cl,APl} from the origin, provides an overall

∆l ) xC2l + AP2l

(12)

estimate of measurement quality, with the semicircles at radii 1, 2, ..., Q∞ providing the visual ruler. Measurements performed by all participants with (Cl,APl) within the radius-1 half-ring are expected (on average) to be “exceptionally” comparable to consensus results, those between radius-1 and radius-2 to be “acceptably” comparable, those between radius-2 and radius-3 to be “marginally” comparable, and those outside the radius-3 halfring to be “poorly” comparable. The (Cl,APl) pairs for all participants are displayed to enable assessment of relative performance, with the pairs for NIST analysts clearly denoted to enable comparison with known measurement systems. Each participant’s “relevant” past performance is summarized by displaying the Cl and APl estimates from the nine previous studies. The historical Cl are displayed at the very top center of the plot, in descending order from least to most recent. The historical APl are displayed at the bottom-left of the plot, in increasing order from least- to most recent. Sequential values for both Cl and APl are connected with lines, enabling visualization of any periods of nonparticipation. Youden Precision Plots: Target plots are constructed using APl, enabling all participants to visualize their performance in every interlaboratory study they participate in regardless of previous participation. From an “outsider’s” perspective, it is immaterial if a large APl is due to short-term measurement imprecision or to sample-specific discordance: in neither case simple mathematical adjustment will improve that laboratory’s comparability. From the participant’s perspective, however, identifying the cause(s) of apparent imprecision facilitates identification of the measurementsystem change(s) needed to improve comparability; the modifications required to address specific interferences are quite different from those required to improve sample-independent measurement precision. Figure 4 presents examples of Youden precision plots for the data presented in Figure 1, intended to help evaluate the nature of the various precision estimates that can be calculated for a particular suite of samples. Like box plots, the exact structure of these precision plots depends on the experimental design. When none of the samples distributed are replicates, only APl can be calculated; hence, no precision plot can be generated. If blind replicate samples are included (here, serum 228 was distributed as blind replicates 223 and 225 in the immediately preceding M2QAP study), it may be possible to make 3-way contrasts among APl, Pshort , and Plong . When a replicate sample known to be l l analytically challenging is distributed, it may be possible to contrast APl calculated for all samples with that calculated just for “typical” samples and with that calculated just for the challenge sample. Whenever two or more precision estimates can be calculated, the basic structure of the precision plot is the same: one of the

Figure 4. Youden Precision Component Plots. Three different precision estimates are contrasted: Plshort as a function of APl, Pllong as a function APl, and Pllong as a function of Plshort. Each of the small plots has identical structure: both axes span 0 to 3 × Sj, a line of unit slope denotes equality between estimates, the open circle denotes the participant’s precision estimates, the open diamonds denote data for NIST analysts, and the + signs denote data for all other participants. This precision plot displays retinol measurements for M2QAP XXXIX reported by the CDC.

precision estimates is plotted as a function of another. We display the “most reliable” (largest number of samples included) precision estimate along the horizontal axis (generally, APl has the most and Pshort the least). Since all the precision estimates are l constrained to lie between zero and Q∞ in units of Sj, the scale and labeling of the horizontal and vertical axes are identical and are the same for all precision-component plots. A single line of unit slope diagonally bisecting the plot area provides a reference frame. As with target plots, the participant’s measurement performance is placed into context by displaying precision pairs for all participants. The precision estimates for NIST analysts are specially marked to enable comparison with known measurement systems. EXAMPLES AND DISCUSSION Figures 1-4 display retinol measurements from M2QAP interlaboratory comparison exercise XXXIX reported by the Nutritional Biochemistry Branch of the Centers for Disease Control and Prevention (CDC). Two of the four samples distributed in this March 1997 study were previously distributed: serum 227 was distributed as 202 in September 1994 and as 210 in June 1995 and serum 228 was distributed as blind replicates 223 and 225 in September 1996. Both Pshort and Plong can be estimated l l reflects September from these replicated samples, albeit Pshort l 1996 performance. All of the CDC values were within the central 50% of reported values, with the marginal exception of the lowretinol serum 229. Examination of the comparability plot indicates that the CDC values are in excellent agreement with the interlaboratory medians over all but the very lowest concentrations Analytical Chemistry, Vol. 71, No. 9, May 1, 1999

1875

Figure 5. Box, comparability, target, and precision plots for the retinol measurements for M2QAP XXXI reported by an academic nutrition research laboratory. Serum 196 is a repeat distribution of 183; sera 198, 195, and 197 are a factor-of-two dilution series prepared by combining a “normal” serum pool with delipidized serum.

examined. The target plot and precision plots confirm that the CDC maintains superb control over all concordance and precision components. It is our belief that this excellent comparability reflects the CDC’s rigorous implementation of formal quality assurance/control programs, including internal programs using serum pools that span the range of expected values for their analytes of interest as well as participation in external programs such as the M2QAP. Figure 5 displays box, comparability, target, and precision plots for the retinol data of M2QAP XXXI reported by an academic micronutrients research laboratory. Only one of the four sera used in this June 1994 study had been previously distributed: serum 196 was previously distributed as 183 in June 1993. The other three sera are a factor-of-two dilution series, with serum 195 having been prepared from equal volumes of serum 198 pool and delipidized serum and serum 197 prepared from 1 volume of serum 198 and 3 volumes of delipidized serum. All values except for the unusually hypolipidic, low-retinol serum 197 agree well with the median. The comparability plot indicates that (at least for one previous serum) low-retinol serum is not itself the source of this incomparability. While based upon different samples, the 1876 Analytical Chemistry, Vol. 71, No. 9, May 1, 1999

large APl seen in both the target and precision plots indicate possible serum lipid dependence in this participant’s retinol measurement system. Subsequent distributions of similarly hypolipidic sera tend to confirm this hypothesis, although the specific relationships among this participant’s measurements, the interlaboratory medians, and truth for these abnormal samples remains to be established. Figure 6 displays box, comparability, target, and precision plots for the retinol data of XXXVI reported by a different academic nutrition laboratory. All four of the sera distributed in this March 1996 study were previously distributed, although this participant had not participated in one of these previous studies. Serum 215 was distributed as the blind replicates 170 and 172 in September 1992, serum 216 was distributed as the blind replicates 182 and 185 in June 1993, serum 217 was distributed as 194 in June 1994, and serum 218 was distributed as 192 in March 1994 and 199 in September 1994. All four of this participant’s retinol values for XXXVI are lower than expected, although most of the previous values had been within the central 50%. The comparability plot suggests that there is no functional difference in measurement performance with retinol concentration, but does indicate a less

Figure 6. Box, comparability, target, and precision plots for the retinol measurements for M2QAP XXXVI reported by an academic research laboratory. Serum 215 is a repeat distribution of the blind duplicates 170 and 172, serum 216 is a repeat distribution of the blind duplicates 182 and 185, serum 217 is a repeat of 194, and serum 218 is a repeat of 192 and 199.

stable measurement system than achieved by the participants discussed above. The target plot indicates a generally excellent APl but a periodically varying and large Cl; the precision plot suggests that both Pshort and Plong are less well controlled than l l APl. The larger than expected Pshort was observed early in this l laboratory’s participation and thus may not properly represent their XXXVI measurement characteristics. We believe that the pattern of good precision but sample-independent discordance may be related to different analysts and/or calibration solutions. Figure 7 displays box, comparability, target, and precision plots for the retinol data of XXXV reported by a contract-analysis laboratory. All four of the sera distributed in this September 1995 study had been previously distributed: serum 213 is an abnormallyhigh-concentration retinol sample first distributed as 191 in March 1994, and sera 212, 214, and 211 are the factor-of-two dilution series first distributed in June 1994. The reproducible underestimation of retinol in the very high concentration (roughly 6 times the upper limit of the expected range20) is characteristic of linear extrapolation from a calibration function approaching saturation. The

reproducible overestimation of retinol in the low concentration hypolipidic sera 211 and 197 may again indicate a serum lipid dependence; however, the comparability plot suggests that this participant often (but not always) overestimates retinol in the lownormal range. The inconsistent results for the high-normal sera 212 and 198 may be an isolated problem; however, both Cl and APl for this participant are typically, if erratically, large. We believe that the observed measurement incomparabilities reflect a measurement system that (1) is optimized for “normal” samples, (2) is unusually sensitive to modifications of native serum matrixes, and (3) may have less stringent data-quality goals than some other participants. The retinol underestimation in the very high concentration challenge samples has little clinical consequenceswhether reported as 4, 5, or 6 times normal or just “very high!” such an extreme value should prompt the same urgent response. The precision plot shown for XXXV contrasts the relationships between APl and several different definitions of Plong : one based on all l four samples (“All, Long-term”), one based just on the three Analytical Chemistry, Vol. 71, No. 9, May 1, 1999

1877

Figure 7. Box, comparability, target, and precision plots for the retinol measurements for M2QAP XXXV reported by a contract-analysis laboratory. Serum 213 is a repeat distribution of 191, sera 212, 214, and 211 repeat the dilution series 198, 195, and 197 first distributed in XXXI.

dilution series samples (“Series, Long-term”), and one just for the challenge serum (“High, Long-term”). The “All” and “Series” relationships are quite similar for most participants; however, at least four participants have very high APl only when the challenge sample is included in the calculation. We strongly suspect that the measurement systems used in these laboratories incorrectly, but inconsequentially, extrapolate beyond the domain of their calibration functions. The participant discussed in Figure 5 is one of the two participants that have large APl but small Plong using l just the dilution series samples; the remaining participant with similar (APl,Plong ) data is also consistently discordant given l hypolipidic serum. We suspect, but have as yet been unable to verify, that these two laboratories use similar measurement systems.

1878 Analytical Chemistry, Vol. 71, No. 9, May 1, 1999

ACKNOWLEDGMENT All M2QAP activities through FY 1997 were supported in part by the Division of Cancer Prevention and Control, National Cancer Institute, National Institutes of Health. We thank all of the analysts who have participated in the M2QAP studies and workshops. Their participation, enthusiastic support, and critical feedback are crucial to the continued improvement of interlaboratory measurement comparability.

Received for review September 28, 1998. Accepted February 6, 1999. AC981074K