NIST Micronutrients Measurement Quality Assurance Program

Michele M. Schantz , Carissa D. Powers , Rosemary L. Schleicher , Joseph M. Betz .... Phyllis E. Bowen , Wolfgang Schalch , Barry S. Levine ... Phylli...
2 downloads 0 Views 143KB Size
Anal. Chem. 2000, 72, 3611-3619

NIST Micronutrients Measurement Quality Assurance Program: Characterizing Individual Participant Measurement Performance over Time David L. Duewer,* Margaret C. Kline, Katherine E. Sharpless, and Jeanice Brown Thomas

Chemical Science and Technology Laboratory, National Institute of Standards and Technology, Gaithersburg, Maryland 20899 Maria Stacewicz-Sapuntzakis

Department of Human Nutrition and Dietetics, The University of Illinois at Chicago, Chicago, Illinois 60612 Anne L. Sowell

Nutritional Biochemistry Branch, Centers for Disease Control and Prevention, Atlanta, Georgia 30341

The mission of the Micronutrients Measurement Quality Assurance Program (M2QAP) at the National Institute of Standards and Technology is enhanced interlaboratory measurement comparability for fat-soluble vitamin-related measurands in human serum. We recently described improved tools for evaluating individual participant measurement performance in single interlaboratory comparison exercises; we here apply and extend these tools to the evaluation of participant performance over the entire 15-year history of the M2QAP. We describe and illustrate a set of interconnected graphical reporting tools for identifying long-term trends and single-exercise events. We document and discuss recurrent patterns we observe in the measurement performance characteristics for M2QAP participants. The graphical analysis techniques utilized may be applicable to other interlaboratory comparison programs. In 1984, the Micronutrient Measurement Quality Assurance Program (M2QAP) was established at what is now the National Institute of Standards and Technology (NIST). The primary goal of the M2QAP is enhanced among-participant comparability of vitamin-related measurements in human serum. The basic vehicle for both evaluating and influencing the state-of-the-measurement art is the interlaboratory comparison exercise. Through mid-1999, 42 M2QAP interlaboratory comparison exercises have focused on the measurement of total β-carotene, R-tocopherol, total or trans-retinol, and other fat-soluble vitaminrelated measurands.1 While the primary motivation (clinical, epidemiological, nutritional), nature (government, private sector, university), and international composition of the participant * Corresponding author: (tel) 301-975-3935; (fax) 301-977-0685; (e-mail) [email protected]. (1) Duewer, D. L.; Brown Thomas, J.; Kline, M. C.; MacCrehan, W. A.; Schaffer, R.; Sharpless, K. E.; May, W. E.; Crowell, J. A. Anal. Chem. 1997, 69, 14061413. 10.1021/ac991481b CCC: $19.00 Published on Web 06/14/2000

© 2000 American Chemical Society

community has changed over time, the evolution of the community has been relatively gradual.1 The number of participants in each exercise (defined by the return of results, not receipt of samples) has ranged from 19 to 56 with a median of 43, a sufficiently large number to enable meaningful data analysis. We recently described graphical analysis tools developed to help M2QAP participants evaluate their measurement performance for a given exercise relative to both the other participants’ results and their own “recent” (about three years) measurement history.2 To facilitate translation of numerical values into possible chemical implications, these graphical tools focus on measurement comparability and two of its constituent statistical attributes: concordance and apparent precision.2 Comparability is the multiparticipant analogue of measurement accuracy: for M2QAP studies, for a single measurement it is defined as the difference from the consensus result, normalized to the among-participant measurement variability. Concordance is the multisample, multiparticipant analogue of “bias” or “trueness”; it is defined as the average comparability for a given participant across all similar-matrix samples in a given interlaboratory exercise. Apparent precision is likewise a statistical property of multisample, multiparticipant measurements; it is defined as the standard deviation of the participant’s comparabilities for all samples of an exercise. While apparent precision includes all of the random variance components comprising “precision” for multiple measurements made on single samples, it also includes systematic differences in measurement system response arising from differences in sample composition. A surprising majority of the thus-far 111 M2QAP participants have reported data in a sufficient number of exercises to establish characteristic “signatures” of concordance and/or apparent precision. Two participants have reported data in all of the 42 fat-soluble vitamin exercises and for all 170 samples distributed. (The median participation rate is 14 exercises or roughly five years of measure(2) Duewer, D. L.; Kline, M. C.; Sharpless, K. E.; Brown Thomas, J.; Gary, K. T.; Sowell, A. L. Anal. Chem. 1999, 71, 1870-1878.

Analytical Chemistry, Vol. 72, No. 15, August 1, 2000 3611

ment characterization. The median number of samples analyzed during this period is 47.) We now describe graphical tools developed to display and characterize these signatures. We find that there are a number of representative themes in the signature patterns that may be diagnostic of particular sources of measurement incomparability. We believe that enabling interlaboratory comparison program participants to more fully evaluate their own measurement performance will help some participants enhance their measurement systems. For at least some of the M2QAP measurands, improved within-participant measurement system control appears necessary for improving among-participant measurement comparability. We hope to further the M2QAP’s major goal of improved comparability by periodically focusing each participant’s attention upon the concordance and apparent precision aspects of their truly long-term measurement performance. Graphical performance summaries, as described below, were provided to all 1999 M2QAP participants. DEFINITIONS The M2QAP conceptual tools for characterizing interlaboratory comparison measurement performance are discussed in detail elsewhere.2 The following definitions are essential to the current discussion. Measurement Performance Characteristics. The standardized measurement comparability for a particular measurand X in a given sample j distributed in a given interlaboratory comparison exercise k reported by a given participant l is defined

∆jkl ) (Xjkl - X h jk)/S(X h jk)

(1)

where Xjkl is the measured concentration, X h jk is the consensus concentration, and S(X h jk) is the among-participant measurement standard deviation expected for the given concentration. The measurement concordance for the participant is the average standardized comparability for every sample distributed in the exercise Nk

Ckl )

∑∆

jkl/Nk

(2)

j)1

where Nk is the number of samples analyzed. The apparent precision is the standard deviation of the comparabilities

x∑ Nk

APkl )

(∆jkl - Ckl)2/(Nk - 1)

(3)

j)1

The expected deviation from consensus expected for the participant’s measurement process at the time the interlaboratory samples were analyzed is the square root of the summed squares of the concordance and apparent precision

Dkl ) xCkl2 + APkl2

(4)

The M2QAP uses robust quartile-based statistics to estimate 3612

Analytical Chemistry, Vol. 72, No. 15, August 1, 2000

observed consensus and among-participant standard deviation values.3 The expected standard deviation for a given concentration is calculated from an empirically parametrized functional relationship between these observed consensus and standard deviation values.1 We have recently parametrized the functional relationships using just the data from the most recent 10 interlaboratory exercises, conducted from February 1996 through May 1999. Measurement comparability, concordance, apparent precision, and expected deviation thus are dimensionless quantities expressed relative to the “expected among-participant measurement standard deviation characteristic of M2QAP participants during the time period 1996 through 1999”. We abbreviate this unit as “SD96:99”. Measurands. Three “analytes” have been routinely reported in all of the M2QAP interlaboratory exercises: R-tocopherol, total β-carotene, and total or trans-retinol. These analytes represent different types of measured quantities or “measurands”: single entity, well-defined composite entity, and multiple entity. Since no M2QAP participant thus far differentiates optical isomers, R-tocopherol nominally designates a single chemical entity. By long-established policy, total β-carotene explicitly designates the sum of all β-carotene geometrical isomer (trans, cis, dicis, etc.) concentrations. During the period reported in this study, the M2QAP did not similarly differentiate between total and trans-retinol. Different participants are now known to define “retinol” differently, including an unknown fraction (none to all, depending on their own policy and chromatographic resolution) of cis-retinol isomers along with trans-retinol. While “retinol” thus is multiply defined across the measurement community, each individual participant’s definition may be consistent over time. GRAPHICAL TOOLS Single Measurands. Time Series Plot. Value versus time plots enable both the qualitative recognition of events and trends and at least semiquantitative determination of location and dispersion characteristics at any given time or period. Figure 1 displays the comparability, concordance, apparent precision, and expected deviation for total β-carotene measurements reported by the Department of Human Nutrition and Dietetics, The University of Illinois at Chicago (UIC), versus the interlaboratory exercise completion date. This participant has analyzed and reported values for all 170 samples distributed in all 42 exercises in the entire 15-year history of the M2QAP. The lower section of the time series presents the comparability for every sample analyzed and the concordance for every comparison exercise for which results were reported. Both of these performance characteristics may, in principle, range from -∞ to +∞; however, standardizing the difference between the measured and the consensus value to the expected amongparticipant standard deviation ensures that values much smaller than -3 and larger than +3 are uncommon. Displaying these characteristics at a scale of -3 to +3 provides adequate graphical resolution; plotting any off-scale data at the axis ensures that little or no information is lost. Participant UIC’s 1984 through 1987 total β-carotene comparability averages about -2 × SD96:99, rapidly improved to less than -1 × SD96:99 during 1988 through 1991, and further improved to virtual identity with the consensus from 1992 through the most recent exercise. (3) Analytical Methods Committee, Royal Society of Chemistry. Analyst 1989, 114, 1693-1697.

Figure 1. Time-series plot for total β-carotene measurements reported by the Department of Human Nutrition and Dietetics, The University of Illinois at Chicago. The lower section displays comparabilities (∆jkl, solid circles for early measurements and open circles for recent measurements) for 170 samples and concordances (Ckl, dark line) for 42 interlaboratory exercises. The upper section displays apparent precisions (APkl, dark line) and expected deviations (Dkl, light line) for each exercise. The distinction between early and recent measurements serves to visually interconnect this time-series plot with the following graphical tools.

While apparent precision and expected deviation can be qualitatively assessed from the average location and range of comparabilities at a given date, the upper section of each time series plot provides a more quantitative analysis. The minimum allowed value for both apparent precision and expected deviation is zero; as above, values larger than +3 are uncommon. Displaying these characteristics at the same scale of 0 to +3 and the same graphical resolution as used for the lower section facilitates comparison of the relative contributions of concordance and apparent precision to the expected deviation. Again, plotting any apparent precision and expected deviation larger than 3 at the axis limit minimizes information loss. Participant UIC’s apparent precisions are uniformly excellent, averaging less than 0.5 × SD96:99 over all exercises. The moderate to slight discordance of the earliest measurements dominates expected deviation through 1992; since that time, apparent imprecision has dominated. Ogive Plot. If a given group of comparabilities for a given participant can be assumed to represent random draws from a stationary measurement process, then the distribution of those comparabilities identifies the probability distribution characteristic of that process. Such a quantitative model for the measurement process can be a powerful aid to identifying and learning from truly exceptional eventssa critical if implicit component of any continuous quality improvement program.4 Ordering a set of values from smallest to largest and plotting them against their rank displays the empirical cumulative distribution or “ogive” of the set.5 Figure 2 displays ogives for two sets of the total β-carotene comparability data of Figure 1, one for the (4) Roberts, G. W., Ed. Quality Planning, Control, and Improvement in Research and Development; Marcel Dekker: New York, 1995. (5) Schmid, C. F., Schmid, S. E. Handbook of Graphic Presentation, 2nd ed.; Wiley: New York, 1979.

Figure 2. Ogive plot for the total β-carotene measurements of Figure 1. The solid circles denote comparabilities (∆jkl) for the 132 samples distributed in the 32 early exercises, ordered from smallest to largest and plotted against their probability-scaled rank (PSRank) in the ordering. The open circles denote comparabilities for the 38 samples distributed in the 10 most recent exercises, again ordered from smallest to largest and plotted against their PSRank. The three vertical lines crossing the zero-comparability reference line denote PSRank values of 0.25, 0.50, and 0.75sthe approximate division points for the first, second (median), and third quartiles of the ordered values. The dark line denotes the cumulative distribution or “ogive” expected for a normally distributed population having mean µe and standard deviation σe, with the numeric values listed to the left side of the plot. The light line likewise denotes the ogive for a normally distributed population having mean µr and standard deviation σr, listed to the right side of the plot.

132 samples distributed in the first 32 M2QAP exercise and the other for the 38 samples distributed in the 10 most recent exercises. Direct comparison of the two sets is made possible by converting the separate rank order values to a common 0-1 probability scale:

PSRankj ) Rankj /(n + 1)

(5)

where n is the number of samples and Rankj is the rank of sample j (1-n) in each smallest-to-largest ordered set. This ogive plot can be used to evaluate any postulated probability distribution. Most of the M2QAP comparabilities thus far examined are “normally distributed” once allowance is made for a minority fraction of exceptional values. Figure 2 also displays ogives for normal distributions having the mean, µ, and standard deviation, σ, empirically estimated from the observed comparabilities using robust quartile-based statistics.3 If the observed and empirically parametrized normal ogives agree well, then µ describes the expected discordance and σ the apparent imprecision of the measurement process over the time period evaluated. The distribution σ also includes variance from any changes in the measurement process during the relevant time period. Participant UIC’s “early” comparabilities are well described as a normal distribution with an expected discordance, µe, of -0.64 × SD96:99 and expected apparent precision, σe, of 0.69 × SD96:99. There are a few exceptionally negative comparabilities (left side of the plot) traceable (by comparison to the time series plot of Figure 1) to systematically smaller-than-consensus measurements in 1985 through 1986 exercises. Likewise, the recent comparabilities are well described as a normal distribution with expected discordance, µr, of only -0.08 × SD96:99 and expected apparent Analytical Chemistry, Vol. 72, No. 15, August 1, 2000

3613

Figure 3. Retrospective concordance/apparent precision “target” plot for the total β-carotene measurements of Figure 1. The semicircles bound the possible combinations of concordance (Ckl) and apparent precision (APkl) that would result in expected deviation (Dkl) values of 1, 2, and 3. The solid circles denote the (Ckl, APkl) pairs for 32 early exercises. The open circles denote the (Ckl, APkl) pairs for 10 most recent exercises.

precision, σr, of 0.28 × SD96:99. Unlike the early data, there are perhaps eight exceptionally positive comparabilities (right side of the plot) scattered throughout many of the recent exercises. Retrospective Concordance/Apparent Precision Plot. The concordance/apparent precision or “target” plot was developed to summarize a participant’s measurement performance in a given interlaboratory exercise relative to that of the participant community (and to NIST analysts).2 In a similar manner, Figure 3 displays the characteristic (Ckl, APkl) pairs for the total β-carotene data of Figure 1, providing a compact retrospective summary of measurement performance that focuses on the question, “How well have we done?” While the time-series tool provides this information, the level of detail can be distracting. The retrospective “target” plot efficiently displays the marginal distributions of concordance and apparent precision (the overall distribution without regard to location along the time axis) and their joint distribution (apparent precision as a function of concordance), as well as displaying the expected deviation (distance from the center of the target). As in the ogive plot, dividing the (Ckl, APkl) pairs into visually distinguishable early and recent sets enables a “then” and “now” comparison. Participant UIC’s total β-carotene measurements were “off target” (outside the Dk ) 3 semicircle) in only two or three of the early exercises; in all recent exercises, UIC’s measurements have been very much “on target” (inside the Dk ) 1 semicircle). Measurement Performance Summary Display. The above three plots are designed to be displayed as a single interlocked unit, with each element presented at the same graphical scale. Figure 4 presents such measurement performance summary displays for R-tocopherol and retinol measurements reported by the same participant and for the same time period as the total β-carotene data displayed in Figures 1-3. While not quite as outstanding as the total β-carotene measurements, nearly all of participant UIC’s R-tocopherol and retinol measurements have been “on target.” The R-tocopherol measurement characteristics follow the same general pattern as those for total β-carotene. The retinol measurements display a different pattern, with a period of increased discordance and/or decreased apparent precision from 1992 through 1996. There are few if any clearly exceptional comparabilities for either measurand, although all retinol measurements in one 1995 exercise are more than -2 × SD96:99 from consensus. 3614 Analytical Chemistry, Vol. 72, No. 15, August 1, 2000

Multiple Measurands. Stacked Summary Displays. Stacking the measurement performance summary displays so that the timeseries plots can be easily compared allows ready discrimination among measurand-specific, sample-specific, and exercise-specific events. As noted above, participant UIC’s R-tocopherol and total β-carotene measurements show the same general pattern of excellent apparent precision and stepwise changes in concordance. Since the plots are not stacked together, it requires some effort to trace the similarities and differences. The contrasting behavior of the R-tocopherol and retinol characteristics in Figure 4 is more readily perceived. The largest discordance “spikes” for retinol do not align with the (smaller) spikes for the other two measurands. Comparability Multiplots. While visual intercomparison of timeseries plots for two or more measurands is expected to be a powerful tool for interpreting measurement performance, such analysis is by nature qualitative and inevitably subjective. Given that 14 measurands have been reported by at least 10 participants for at least the last 10 exercises, the number of visual evaluations possible may also become unwieldy. Displaying comparabilities for two or more measurands determined in the same samples as a multiple scatterplot (“multiplot”), the graphical analogue of a correlation matrix, provides a more quantitative and less fatiguing way to identify the more potentially interesting interactions. Figure 5 presents such a multiplot for the total β-carotene, R-tocopherol, and retinol data reported in Figures 1 through 4. The scale for both axes of every cell of a multiplot is again -3 to +3, with all out-of-range comparabilities displayed at the cell margins. Given that from 1 to more than 14 measurands are reported by different M2QAP participants, it is not practical to use the same graphical resolution for every multiplotsalthough each component of a given multiplot is best displayed at the same resolution. Up through at least 10 analytes can be usefully displayed on a single page. Division of the data into early and recent sets again facilitates identification of the most relevant behavior. As anticipated from the qualitative comparison of measurement characteristic patterns, there is little correlation among participant UIC’s recent measurement comparabilities. However, there is significant correlation between R-tocopherol and total β-carotene and possible anticorrelation between R-tocopherol and retinol in the early data. EXAMPLES AND DISCUSSION Truly establishing that particular patterns in a participant’s measurement characteristics result from given events requires detailed knowledge of all aspects and history of the participant’s measurement system(s), analyst(s), and operating procedure(s). No external reviewer can have all of the necessary information to establish particular cause-effect relationships; these are best determined by the analysts directly involved with the measurements. However, there are recurrent patterns in the measurement characteristics of our own and other M2QAP participant results. We believe that many of these patterns are “signatures” diagnostic of certain broad classes of measurement system events. We base our interpretations upon evidence gleaned through our unique roles as (1) chemical analyst participants in the M2QAP interlaboratory comparison exercises, (2) producers/providers of the lyophilized serum samples used in each exercise, and (3) data analysts/coordinators responsible for assembling all data from and

Figure 4. Measurement performance summary displays for retinol and R-tocopherol measurements reported by the Department of Human Nutrition and Dietetics, The University of Illinois at Chicago. The three segments of each summary display are described in Figures 1-3.

disseminating summary results to all other participants. Indeed, some sample- and/or exercise-specific events can only be explored through active collaboration among participants and program coordinators. Figure 6 displays performance characteristics for retinol, R-tocopherol, and total β-carotene measurements reported by the Nutritional Biochemistry Branch of the Centers for Disease Control and Prevention (CDC). This participant has analyzed all 150 samples distributed since the beginning of 1987. While the measurements reported by participants UIC and CDC are characterized by the same general pattern of excellent concordance and apparent precision, the temporal locations of discordance and/ or apparent imprecision “spikes” are not the same and are related to different measurement events rather than sample anomalies. Figures 7-9 display archetypal performance characteristics of two research laboratories. Results for only 20 sequential exercises are displayed in these figures to maintain participant anonymity; the data have not been otherwise manipulated. Figure 7 displays characteristics for retinol and R-tocopherol measurements reported by a clinical research center; Figures 8 and 9 display

characteristics for retinol, R-tocopherol, and total β-carotene reported by a University nutrition laboratory. Long Periods of Uniform Apparent Precision. Extended periods of consistently small-magnitude apparent precision (Figures 1, 4, 6, and 7) suggest that the participant emphasizes shortterm (at least) measurement quality control. Measurements made on the same sample within a short time frame are likely to be very similar; i.e., highly repeatable. However, good repeatability can be accompanied by many different patterns of discordance: from the consistency of Figure 6, through the evolutionary to somewhat labile patterns of Figures 1 and 3, to the largemagnitude periodicity of Figure 7. Long Periods of Uniform Concordance. Extended periods of relatively constant concordance (Figures 1, 4, and 6) suggest that the participant emphasizes measurement process stability. The very stable concordances of Figure 6 reflect the very longterm epidemiological focus and extensive internal and external quality assurance activities of participant CDC. While this participant has made incremental modifications over the past 12 years, Analytical Chemistry, Vol. 72, No. 15, August 1, 2000

3615

Figure 5. Comparability multiplot for the retinol, R-tocopherol, and total β-carotene measurements presented in Figures 1-4. Each cell of the multiplot displays the joint distribution of comparabilities for a given pair of analytes, displayed on horizontal and vertical axes of span -3 to +3. Each symbol in a cell represents two different measurements reported for a given sample in the same interlaboratory exercise: the solid circles denote early exercise samples; the open circles denote samples distributed in the 10 most recent exercises.

the basic measurement protocol has been kept as invariant as practical. Step Changes in Concordance. Abrupt (less than a year) changes in otherwise stable concordance (Figure 1 and R-tocopherol in Figure 4) are evidence of discrete changes in the measurement process. The step changes in the measurement concordance of participant UIC’s measurands are associated with a number of intentional, multifactor protocol changes. These process improvements include the following: detector wavelength optimization, computerized data collection and analysis, better calibration materials, more reliable methods for determining calibrant purity, and replacement of stainless steel column inlet and outlet frits with biocompatible materials. Long-Period Concordance Oscillation. Relatively gradual (on the order of one to two years) but large-magnitude (more than 1 × SD96:99) changes in concordance (Figure 7) are likely a signal of uncontrolled factors in the participant’s measurement process. Since the concordance oscillations for retinol and R-tocopherol measurements in Figure 7 are somewhat but not highly correlated, we suspect degradation of calibration materials. Labile Concordance and Apparent Precision. Rapid, largemagnitude oscillation of both concordance and apparent precision (Figure 8) occurs when many of the measurements for a set of samples are very and inconsistently different from consensus. Given the small number of samples (typically four) distributed per M2QAP exercise, “on average” does not reliably average to zero in such situations. Labile concordance associated with labile apparent precision could result from use of a measurement system with quite different properties from those used by the majority of participants (e.g., normal-phase separation or a system designed for very rapid semiquantitative screening). However, it may also reflect poor measurement repeatability; we observe this general behavior for several academic laboratories where participation in 3616

Analytical Chemistry, Vol. 72, No. 15, August 1, 2000

the M2QAP exercises has been integrated into the graduate training program. While we believe poor repeatability in many cases reflects analyst inexperience, poor repeatability for M2QAP samples does not necessarily imply poor repeatability for “real” measurements on fresh serum... reconstituting lyophilized serum is something of an art in itself. Apparent Precision “Spikes”. A relatively large apparent imprecision for a single exercise that is not associated with a large discordance (Figure 6, R-tocopherol and total β-carotene) typically occurs when the comparability for one sample is very inconsistent with the comparabilities of the other samples in the set. Such single-sample inconsistent comparabilities can result from many different sample-specific events: misidentified or damaged sample, improperly reconstituted or contaminated sample, improperly prepared sample extract, isolated equipment and/or automation malfunction, transcription error, or sample-specific interference. It is possible to narrow the potential causes for a particular “spike” if the participant reports multiple measurands: sampleor aliquot-wide events should be manifest across measurands. While not included in Figure 6, participant CDC’s total β-carotene apparent precision spike in the second exercise of 1996 is associated with increasingly negative comparability for increasingly polar measurands in one particular serum. This trend in decreasing recovery is compatible with that expected for incomplete extraction.6 It is also possible to better identify events through more detailed analysis of the comparabilities for the other samples in the sets and for measurements on the same serum reported for other exercises. The spike in the third exercise of 1995 in the CDC’s R-tocopherol apparent precision appears to result from a transient (and minor) calibration event. The transience is documented (using graphical tools developed for this purpose 2) by the agreement between the participant’s R-tocopherol measurements and the consensus values for the “problem” serum when distributed in an earlier and a later exercise. The source is suggested by the following: (1) none of the participant’s other measurand values for that serum are similarly different from consensus, (2) the “problem” serum has an unusually low consensus R-tocopherol concentration, and (3) the difference between measured and consensus concentration for R-tocopherol in the four samples distributed in this exercise changes fairly smoothly from negative to positive with increasing consensus concentration. Concordance “Spikes”. A relatively large change in a participant’s concordance for a single exercise (Figures 1, 4, and 6) occurs when the measurements for all samples in the exercise are similarly affected. When just one measurand out of several reported is unusually discordant, the most likely sources of the discordance are calibration errors, calculation errors, or mistaken concentration units. However, the large spike in participant CDC’s retinol in the second exercise of 1998 results from the presence of a plasticizer interference (attributed to the long-term storage of one particular serum pool) in all four of the samples distributed in this exercise. While many participants noted the presence of an unusual peak near that for retinol isomers, in this participant’s (6) Duewer, D. L.; Chesler, S. N.; Scho¨nberger, E.; MacChrehan, W. A. Int. Symp. Laboratory Automation Robotics, 1992 Proceedings; Zymark: Hopkinton, MA, 1993; pp 727-741.

Figure 6. Measurement performance summary displays for retinol, R-tocopherol, and total β-carotene measurements reported by the Nutritional Biochemistry Branch, Centers for Disease Control and Prevention.

system it coeluted with trans-retinol. (We note that distribution of four samples derived from the same serum pool in one interlaboratory exercise was an unusual design. More typical

designs ensure that at least one of the samples has been previously characterized and found not to present terribly atypical measurement challenges; in such designs, the presence of this contaminant Analytical Chemistry, Vol. 72, No. 15, August 1, 2000

3617

Figure 7. Measurement performance summary displays and comparability multiplot for retinol and R-tocopherol measurements characteristic of acceptable short-term control but significant longterm drift. To retain participant confidentiality, only data from 20 sequential interlaboratory exercises are displayed. The horizontal axis of the time-series plot is here modified to display measurements ordered by date but displayed on an ordinal series of span 1-20.

in somesbut not allsof the samples would manifest as more apparent imprecision and less discordance.) Any number of systematic sample handling and chromatographic events can result in atypical discordance in multiple measurands. The minor but unusually negative spikes in all of participant UIC’s measurand concordances at the first exercise of 1993 are attributed to a shipping error: the materials for the exercise were shipped late in the week and arrived, quite defrosted, the following Monday. Comparability Correlation. Correlation among comparabilities (Figure 5, R-tocopherol and total β-carotene, and Figure 9, retinol and R-tocopherol) implies that the given measurands have been influenced by the same, nonstationary source(s) of discordance over a fairly long time period. As noted above, the moderate correlation among participant UIC’s measurands (Figure 5) can be attributed to intentional process changes. We expect strong correlation among measurands that are jointly calibrated, such as the β-carotenes. Many participants also 3618 Analytical Chemistry, Vol. 72, No. 15, August 1, 2000

Figure 8. Measurement performance summary displays for retinol, R-tocopherol, and total β-carotene measurements characteristic of significant short-term measurement imprecision. As in Figure 7, only data from 20 sequential exercises are displayed as an ordinal series.

use the trans-β-carotene calibration for the (chemically very similar) R-carotenes. However, the very strong correlation (Figure 9) between the chemically quite dissimilar retinol and R-tocopherol is unlikely to arise from functionally related calibration errors. It

We have also observed moderately negative correlation among measurand comparabilities. For measurands associated with the same incompletely resolved peak envelope, this plausibly results from systematic changes in peak integration algorithms as relative peak sizes change. However, for one of our chromatographic systems we believe that a period of anticorrelation between the tocopherols and other measurands resulted from suppression of the fluorescence signal of the internal standard used with the tocopherols by the partially coeluting internal standard used for the nonfluorescent measurands.7 The anticorrelation structure disappeared after replacing the independent addition of two internal standard solutions with a single addition of a mixed standard solution of fixed composition. ACKNOWLEDGMENT Figure 9. Comparability multiplot for the retinol, R-tocopherol, and total β-carotene measurements summarized in Figure 8.

is more plausibly related to sample handling or extraction events, although they should result in some degree of correlation among all measurands reported. The absence of correlation between total β-carotene and retinol or R-tocopherol (Figure 9) makes such explanations less plausible. The observed correlation structure (and possibly the apparently poor measurement repeatability) may result from inefficient use of internal standards if, as is common practice, this participant uses one internal standard for the retinol and tocopherol isomers and one or more others for the carotenoids. (7) Sharpless, K. E.; Duewer, D. L. Anal. Chem. 1995, 67, 4416-4422

We thank Dr. James J. Filliben for his suggestions for improving our graphical displays, Dr. Willie E. May for encouraging the evolution and continuing financial support of the M2QAP, and the National Cancer Institute for their support during the program’s first 12 years. We thank all analysts who have participated in the M2QAP exercises and workshops: their participation, enthusiastic support, and critical feedback are crucial to the continued improvement of interlaboratory measurement comparability.

Received for review December 28, 1999. Accepted May 9, 2000. AC991481B

Analytical Chemistry, Vol. 72, No. 15, August 1, 2000

3619