Composite multivariate quality control using a system of univariate

rejecting as few as one or as many as all variables In a run, and to provide the analyst with control statistics and graphics that logically relate to...
2 downloads 5 Views 958KB Size
Anal. Chem. 1001, 63,1419-1425

1410

Composite Multivariate Quality Control Using a System of Univariate, Bivariate, and Multivariate Quality Control Rules S.J. Smith,* S.P. Caudill, J. L.Pirkle, and D. L.Ashley Division of Environmental Health and Laboratory Sciences, Center for Environmental Health and Injury Control, Centers for Disease Control, Public Health Service, US.Department of Health and Human Services, Atlanta, Georgia 30333

We propore a composite muitlvarlate quallty control (CMQC) system to control slmuitaneoudy measured variables. This system Is deslgned to detect unacceptable trends and systematlc error In one or more variables, unacceptable random erruf In one or more varlabb, and unacceptable changes In the correlation structure In any palr of variables. I t Is also designed to be tolerant of mWng data, to be capable of rejecting as few as one or as many as all variables In a run, and to provide the analyst wlth control statistics and graphla that loglcally retaie to sources of analytkal error. Quality control rules for unhrarlate, mutHvarlate, and correlation cond l h are IncorporatedIn the system, as are plde dispiaykrg CMQC statlstlc values and control limits for unlvarlate, multlvariate, and correlation parameters. We also discuss advantages of the CMOC over the T2 and principal component multivariate quality control methods. We demonstrate the CMOC procedure uslng data from a laboratory process In whkh 40 variables were measured dwlng 40 charactefizatbn runs and 23 runs analyzing unknowns.

INTRODUCTION By using modern instrumentation, such as gas chromatography or mass spectrometry, analysts can make simultaneous rapid measurement of multiple analytes per specimen. Normally, a quality control (QC) material and several unknown specimens are grouped into a run and analyzed sequentially. Many analytes are measured simultaneously on each specimen in a run. The QC of such measurements presents analysts with special problems due to (1)the difficulty of interpreting one or more control charta for a large number of variables, (2) the intercorrelations inherent in the simultaneous measurements, and (3) the potential for missiig values for some variables in any given run. Tracking large numbers of variables increases the type I error rate. If the variables are statistically independent, the control limits can be adjusted (cf., Bonferroni or Sheffe) but this adjustment reduces the statistical power of the control procedure. Westgard et al. (I) have proposed a multirule method for more than one level of a single variable that might be applied to the simultaneous control of several independent variables. Applying multiple, independent control chart rules would not result in the correct type I error rate, however, because of the elliptical nature of the 'in-control" region. To address this problem, investigators have proposed multivariate approaches, including the T 2chart (2-6) and principal components (PCs) (7,B). Analysts using the T 2or PC methods can take advantage of the correlation of the variables and obtain correct overall type I error rates. Users of the T 2and PC methods, however, must reliably estimate the correlation matrix; therefore, the

* Corresponding author.

number of characterization runs must substantially exceed the number of control variables. If analysts base the correlation matrix estimate on too few characterization runs while using T 2or PC methods, they tend to reject analytical runs when the observed systematic and random errors are small, but the interrelationships among variables deviate from those predicted from the characterization runs. These methods also require that all data be observed for all variables in every run; such complete data collection becomes increasingly unlikely as the number of variables increases. In addition, if the PC method is used, analysts may have difficulty relating the components conceptually to the variables from which they were estimated. Therefore, the components may be difficult to interpret. Furthermore, neither method allow one to easily relate individual variable values to the multivariate control statistic. To address these problems encountered in the multivariate control of large numbers of variables, we propose a composite multivariate quality control system that is designed with the following capabilities: (1)can detect unacceptable trends and systematic error in one or more variables; (2) can detect unacceptable random error in one or more variables; (3) can detect unacceptable changes in the correlation structure in any pair of variables; (4) can tolerate missing data, (5) can reject as few as one or as many as all variables in a run; (6) can provide control statistics that logically relate to sources of analytical error. The control criteria incorporated in the composite multivariate quality control (CMQC) method include univariate, multivariate, and pairwise correlation control statistics. Action and warning rules are specified for each of these statistics. In addition, we propose a composite univariate QC plot and pairwise correlation plots that allow the user to visually relate individual variable results to multivariate events. We illustrate the CMQC approach by using data from a 40-variable process. METHODS Previous multivariate control procedures have been designed to handle the simultaneous control of several related variables, but their use leads to acceptance or rejection of all variables observed in an entire run because one statistic summarizes information for all variables. Thus, these procedures provide no mechanism for rejection of individual variables. This is especially disadvantageous when laboratory resources or samples are limited. Murphy (9) has proposed a system of partitioning T 2to select subsets of out-of-control variables; however, his system was not recommended for more than five variables and would be unfeasible for large numbers of variables. For example, given a 40-variable process, over 800 significance tests are required when using Murphy's method. Therefore, we propose an alternative system that allows analysts to partially reject a run (i.e., as few as one or as many as all variables may be rejected). Because analysts using this approach can identify problems with individual variables, they can take more timely and specific corrective action.

This article not subject to U.S. Copyright. Published l S Q l by the American Chemlcal Society

1420

ANALYTICAL CHEMISTRY, VOL. 63, NO. 14, JULY 15, 1991

Table I. Composite Quality Control Rules Univariate Rules: Reject Measurement(8) for a Single Analyte in a Run Action Rules 1. Extreme deviation: Reject any single variable outside 3 standard deviations during one run. 2. Moderate deviation: Reject any single variable outside 2 standard deviations for two consecutive runs (same side of mean). 3. Trend 1: Reject any single variable displaying a consistent trend up or down for seven consecutive runs. 4. Trend 2: Reject any single variable falling on the same side of the mean for 10 consecutive runs. Multivariate Rules: Reject Measurements for All Analytes in a R u n multivariate control statistic (MCS) = log (sum of squared deviates)/n Action Rules 1. Extreme deviation: Reject a single run outside the 0.003 probability level. 2. Moderate deviation: Reject two consecutive runs outside the 0.05 probability level. 3. Trend 1: Reject if the statistic increases in value for seven consecutive runs. 4. Trend 2: Reject if the statistic falls above the median for 10 consecutive runs. Warning Rule 1. Reject a single run outside the 0.05 level. Correlation Rules: Reject Measurements for a Pair of Analytes in a Run Control statistic: The probability (PCORR) of the observed results for a pair of analytes is calculated on the basis of the elliptical confidence region for pairwise correlation of the two analytes, which is obtained from the characterization runs. Action Rules 1. Extreme deviation: Reject any pair of analytes with PCORR values less than 0.003 on at least two consecutive runs. 2. Trend: Reject any pair of analytes with PCORR values that decreases for seven consecutive runs. Warning Rules 1. Moderate deviation: Reject any pair of analytes with PCORR values less than 0.05 during one run. 2. Trend: Reject any pair of analytes with PCORR values less than 0.5 for 10 consecutive runs.

A summary of the CMQC control rules (Table I) shows the univariate, multivariate, and pairwise correlation warning and action rules. The univariate rules operate on single variables to detect systematic or random errors. The multivariate rules detect systematic and random errors in all of the variables considered together. The pairwise correlation rules supplement the multivariate control rules by providing a test for unacceptable changes in correlation between any two Variables. The process variation for a given analytical system is estimated from a specified number of characterization runs. In the Procedures section, we discuss the number of runs required to characterize an in-control multivariate analytical system; in short, the number depends on the number of variables being simultaneously monitored. Analytical methods used in chemical assays commonly employ standard calibrators, yielding response curves from which concentrations of unknowns are estimated in a series of independent runs. Typically, these methods exhibit among-run variation that greatly exceeds replication error within runs. With these assays, one should use the standard deviation of the run mean concentration for the process variation. This is the approach we use, though the CMQC method is applicable to other methods for computing process variation. Univariate Control Rules. The proposed CMQC univariate control rules are similar to those commonly used in laboratory quality control. That is, they incorporate two and

++++

A

B ,

: a

++ '

++

2 -1

++

'

,++++

+++$

*' *

..* 9

4

9

*

I

9..

++

'

.

I

+++ '

++++ ++++

I

++++++

++++++++

-2 ,

I, , , , ,

-3 -3

-2

, +~ r,+ ++++

, , , _ ,, , , . , , , , , , , , , ,

-1

1,,,,,,,,,(,,,,,,,,,~,,I , , , ,

0

1

2

3

VARIABLE 1

Figure 1. Pairwlse Correlation plot. An example plot of observed standardized deviates is shown for a pak of varlables along with the 95% confidence ellipse determined from the characterization correlations estimate for these two variables.

three standard deviation action limita and tests for runs (Table I). These rules are applied to each variable independently. If any of the four univariate rules are violated for a single variable, the measurements for that variable are not reported for that run. In the univariate case, the values of the control variable measured in the run's QC sample are compared with control limits estimated from preliminary characterization runs to determine if a run is in control at a given probability level. The successive run values are plotted for visual interpretation of trends and shifts in the results. We found that the most useful way to plot control charts is to use standardized deviates. The standardized deviate for each run equals the control variable result minus the overall mean of the control variable in the characterization runs divided by the standard deviation of the control variable in characterization runs. The control chart is then constructed by using reference lines at f2 and 3 standard deviations from the mean. Multivariate Control Rules. We define a new multivariate control statistic (MCS) given by MCS = [log [ C d 2 ] ] / n for K = 0 [log [Ed2+ ( K - l)/K]]/n for 0 < K < n where di is the standardized deviate for the ith variable, n is the number of control variables measured in a single run, and K is the number of variables with missing data. The distribution of this statistic is known only for the case of uncorrelated variables, where it is a simple function of x2. For correlated variables, the distribution is unknown. We used the log transformation of the sum of squares to induce symmetry in an otherwise right-skewed statistic. The divisor (n) standardizes the statistic for the numbers of deviates, and we used the term (K - 1)/K to adjust the statistic for missing measurements. The term K - 1is equal to the expected value of the sum of the squared standardized deviates for the K missing results. To estimate critical values for the statistic, we use Monte Carlo simulation. Pairwise Correlation Rules. We based the pairwise correlation rules on elliptical confidence regions (10) determined from correlation estimates from the characterization runs for each pair of standardized deviates (Figure 1). These ellipses are centered at the point (0,O) corresponding to zero deviation from the characterization means for both variables in the pair. The orientation of each ellipse is determined by the degree of correlation between deviate pairs. The length and width of each ellipse is a function of the number of characterization runs used to compute the correlation estimate.

ANALYTICAL CHEMISTRY, VOL. 63, NO. 14, JULY 15, l 9 Q l 1421

We computed probability (PCORR) values for each pair of deviates for each postcharacterization analytical run on the basis of its distance from and orientation to the (0,O)center of its corresponding characterization ellipse. Each PCORR value for a pair of analytes should be interpreted as the probability that the observed results for that pair is due to the random variation expected on the basis of the estimated correlation between the two analytes. For example, a PCORR value < 0.003 would indicate a pair of results rarely expected for an analyte pair on the basis of the correlation estimates obtained from the characterization runs. Thus, a pair of results with a PCORR value < 0.003 would be analogous to a single result outside a control limit of 3 on a regular Shewhart quality control chart. We recommend that analysts plot 1- PCORR values for the current and all previous runs for each pair of variables where the PCORR value on the current run is less than 0.003 or whose PCORR value on the current and most recent runs are both less than 0.05. Analysts should also produce plots when 10 consecutive PCORR values are all less than 0.5 or when they observe 7 consecutive decreasing PCORR values for 1pair of analytes. PROCEDURES Step 1. Specification of Conditions for QC Analysis. The application of CMQC to a specific problem involves a number of preliminary considerations. First, analysts must decide what variables will be monitored, how many runs will be used to characterize the process, and how missing values and incomplete characterization runs will be handled. Potential variables to be monitored include all outcome variables and any other parameters related to these variables. For example, in mass spectrometry, a parameter such as an ion ratio could be included as a control variable. Including such additional parameters, however, can produce an unreasonable number of control variables. Often, separate criteria such as tolerance limits are adequate for monitoring theae parameters. To determine the number of runs needed to properly characterize the process, we had to consider both statistical error and resource constraints. Using Monte Carlo simulation, we investigated error rates for QC means charts as a function of the number of characterization runs used to estimate the variance of the process. We determined the expected probabilities of false rejects and false accepts for 3a limits (QC means chart) for 10-50 characterization runs. We found that using 10 characterization runs would result in a false rejection rate about 50% larger than expected if the mean and variance of the process were known. We found that using 40 characterization runs would result in only a 20% increase. Thus, by taking 40 characterization runs rather than 10 characterization runs, one can decrease the false acceptance rate by about 40%. These simulations were done for a univariate process; as the number of measured variablea in a multivariate process increases, the chances of observing maverick data for one or more variables in the characterization runs increase. We also simulated a pairwise correlation-based QC procw with a 0.997 probability level control limits. We found that using 10 (or even 20)characterization runs yielded false reject rates that were 27 (or 13) times the rate expected, if pairwise correlation was known. Thirty characterization runs would result in almost a 10-fold increase in the false reject rate, and 40 characterization run8 would result in a 6-7-fold increase. Increasing from 40 to 50 or more characterization runs had little effect on reducing the false reject rate. The false accept rate with 30 and 40 characterization runs is about 1%and 0.8%, respectively. Thus, the actual number of runs required to adequately characterize a multivariate process will depend upon the impact of the error rates on the control of the process and any constraints imposed by available resources (e.g., cost or time to generate a run,availability of control material, etc.).

The sample sizes should be chosen to protect against having too few complete characterization runs (i.e., runs with observations for all variables) and also to protect against unacceptably high false reject and false accept rates. By using T 2or PC, one would be restricted to analyzing only those runs of unknowns that include measurements of all variables, whereas by using the CMQC approach one can accommodate runs from which some measurements are missing. The analyst should determine the criteria for excluding runs with an unacceptable number of missing results. For example, one may wish to specify that at least threefourths of all variables be observed in each reported run and in each characterization run or that certain key variables always be observed in reported runs. Step 2. Characterization Runs: Determining Estimates of the Quality Control Parameters. Once a set of characterization runs have been generated, decisions must be made about their adequacy and quality. Initially a target mean and standard deviation are obtained for each control variable on all data. Then each variable is standardized separately by subtracting its respective mean and dividing by its respective standard deviation, yielding the standardized deviate. After standardizing the variables, we recommend that the characterization data be plotted for visual inspection. An individual univariate plot for each variable should be generated. We also recommend the use of a composite univariate plot that allows the easy detection of multivariable trends. In the composite plot (Figure 2; data are discussed in the illustration section), each connected line corresponds to the standardized measurements for a single variable. The horizontal l i e s correspond to the means of the variables (a value of zero in standardized space). These lines are offset by 6 units to indicate that a point exactly between two linea is 3 standard deviations from its mean. These plots are useful initially in detecting statistically based extreme outliers (e.g., points more than 3.5 standard deviations from the mean), which should be deleted from the characterization data. Variable-by-run matrix displays similar to those in Figures 3 and 4 are useful for summarizing data on the moderate deviations univariate rules. Next, multivariate control statistic values should be computed and plotted for each of the characterization runs (see Figure 5). By using the composite univariate plot and the MCS plot, checks are made to see if any variables or groups of variables are outside expected limits. Necessary corrections are made and characterization values recomputed. To evaluate the correlation structure of the characterization data, correlation coefficients should be computed for each pair of standardized deviates. These correlation estimates are then displayed in a histogram as illustrated in Figure 6 or in a matrix format (Figure 7). If any estimate seems to deviate from the majority, the corresponding variable pair is investigated to determine whether the estimate is consistent with the known relationship between the two variables. If errors are detected in the analytical process or data generation procedures, the necessary corrections are made in the characterization data and new means, standard deviations, standardized deviates, and pairwise correlation coefficients are computed. Correlation QC plots similar to the one shown in Figure 8 are then generated for each pair of variables that violate any of the warning or action rules for correlation, and an analysis of the characterization runs is performed by using each of the univariate, multivariate, and pairwise correlation warning and action rules described in Table I. Step 3. Quality Control Analysis of Unknown Samples. Following the review of the data on a univariate, multivariate, and correlation basis, the working set of char-

1422

ANALYTICAL CHEMISTRY, VOL. 63, NO. 14, JULY 15, 1991 40

s5

35

;

S 1 30

A N D

A

SO

0

1 25

25

0

0

I

I

1

V

V A

7

A

7

15

L

L

;

15

A 8

A 8

:

10

10

5

0

0 68e606606666166ee88000ee806006066866666699999999999999999999999 999999999999999999999999999999999999999900000000000000000000000 111111111112222222222222222223333333333300000000000000000001111 555555567712223345556666778990011233344600123445677770000990001 01367693490026041146235958479461752591923502647271~591579362600 CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCUUUUJUUUUUUU~UUUUJUU~U~ RUN

1000100e88e8068e88ee8886ee8e6600e080688099999999999999999999999 90s999s9s9999999999999999990999s9999~99900000000000000000000000 111111111112222222222222222225333333333300000000000000000001111 555555507702223345550606778990011233344600123445677770000990001 013171934900260411462359584794617525919235026472713591579362660

ccccccccccccccccccccccccccccccccccccccccuuuuuuuuuuuuuuuuuuuuuuu RUN

Flgure 2. (a, left) Composite univarlate plot before statistical outliers in characterlzation runs were omltted. This composite plot shows the standardlzed deviates observed by run for each of the 40 volatile “pounds before statistical outyers were omitted. The horizontallines represent the expected zero deviate for each compound. These lines are offset by 6 units, indlcatlng that a point exactly between 2 lines is 3 standard deviations from its respective mean. Each connected line corresponds to the Standardized measurements for a single compound. Run numbers within the “C” suffix are characterization runs. Run numbers with the “U” sufflx are runs In which unknowns were analyzed. The run number consists of the year “89”, followed by the julian date of the run. (b, right) Composite univariate plot after statlstlcal outliers in characterization runs were omltted. A composite plot simllar to a Is shown. This plot was generated after a 3.5 standard deviatlon outlier rule had been applled

to the observed characterization measurements for each compound acterization values is finalized. These data are used to generate estimates of means, standard deviations, standardized deviates, and correlation coefficients for the analysis of future test runs.

ILLUSTRATION The CMQC method proposed in the preceding section is being applied to a multivariate quality control problem that arose from the Centers for Disease Control’s (CDC’s) involvement in the Third National Health and Nutrition Examination Survey (NHANES111). The Division of Environmental Health Laboratory Sciences (EHLS) in the Center for Environmental Health and Injury Control (CEHIC) is measuring 40 volatile compounds to be collected from about lo00 people. Mass spectrometry is used to measure these 40 compounds. Because of the complex analytical method and need for a large amount of data reduction by computer, about 2 h is required to process each specimen. The limited volume of whole blood available per unknown sample means that a specimen cannot be reprocessed easily or economically. Step 1. Specification of Conditions for QC Analysis. We had two sets of 40 runs available to characterize the volatiles analytical method. The first set was run with QC materials at a higher concentration, and the second set was run with materials at a lower concentration. The calibrators for each set consisted of six different preparations. Each calibrator preparation contained from 5 to 9 different compounds, for a total of 40 identical compounds per set. For purposes of illustration, we will discuss the results for the quality control runs in which the high-concentration materials

were used. We also will show results for 23 postcharacterization runs. Step 2. Characterization Runs: Determining Estimates of the Quality Control Parameters. The figures used to illustrate our analysis of the characterization data also include postcharacterization runs that we will discuss in the next section. A composite univariate chart for all runs is shown in Figure 2. In constructing this chart, we applied a preliminary 3.5 standard deviation outlier rule to the characterization data for each compound in order to remove extreme observations. Using this rule, we removed 7 of 1600 potential observations. Though these points are included in the composite plot in Figure 2a, they and any missing observations appear as zero deviations on the respective horizontal lines of the composite plot in Figure 2b. We elucidated the relationship among the compounds included in the same quality control preparation by positioning them together in six different groups on one composite univariate plot. Data are missing for compounds 19 and 20 after run 89332C in the composite plots (Figure 2). This is because, beginning in run 89335C, the analytical standard was changed for these two compounds. The composite plots indicate that some compounds that were included in the same preparations are highly correlated. For example, the plotted points for compounds 19,20,26-28,and 31-33, seem to be almost parallel. These relatively high correlations can be found in Figure 7, where we see a few correlation coefficients approach unity (cf. the point for compounds 19 and 20). The values of the multivariate control statistic for characterization and unknowns runs that correspond to Figure 2b

1424

ANALYTICAL CHEMISTRY, VOL. 03, NO. 14, JULY 15, 1991

.

1.04

0.0

1v

W W I

\I

Figure 8. Quality control plot of pairwise correlations. The correlation-based probability values for the Standardized devlates of compounds 16 and 38 are shown for 40 characterization runs and 23 runs of unknowns. These probability values represent one minus the probabllity that the observed pak of deviates behave as expected on the basts of thek observed correlation dving cheractefization runs. For example, a probabllHy of 0.99 on the plot Indicates that the relationship between the pair of compounds departs slgnhntty In the current run from that observed dwlng characterlzation. Horizontal lines represent 0.50 and 0.95 probability level control Ilmits. The digit “1” In the plot represents a single out-of-control event. The digit “2” represents out-of-control events in two consecutive runs.

the correlation action rules (Table I) were violated at least once during the 40 characterization runs. The action rules were not violated for any pairs of compounds during the characterization runs. The probability values for compounds 26 and 27 (runs 8915OC-8923OC) are, with the exception of run 89174C, all below 0.5 (Figure 8). Figure 2b shows that the deviations of compounds 26 and 27 from their respective means were very small during these early characterization runs. Thus, the correlation plots and composite plots used together provide us with useful diagnostic information for the analysis of characterization data. Step 3. Quality Control Analysis of Unknown Samples. Using the means, standard deviations, standardized deviates, and pairwise correlation estimates obtained from the 40 characterization runs, we next illustrate the CMQC method by analyzing the 23 postcharacterization runs indicated by a U suffix on the run number (Figures 2-8). The composite univariate plots in Figure 2b show that several compounds had results near 3 standard deviations above their respective means. For instance, 5 of the 10 compounds in the LE preparation group &e., compounds 26-35) had extreme results on run 9001OU. The values for the multivariate control statistic for the 23 runs analyzing unknowns indicate that runs 9001OU, 90052U, and 90079U exceed the 95% probability level control limit and that run 9008913 exceeds the 99% control limit. Therefore, run 90089U was rejected on the basis of the extreme deviation rule. In addition, runs 90082U and 90067U were rejected on the basis of the moderate deviation rule. Applying the univariate rules, we rejected one measurement for compound 39 on run 90003U because it exceeded the limits of the extreme deviation rule, and we rejected one additional measurement for compound 26 on run 9001OU because it exceeded the limits imposed by the moderate deviation rule. The extreme deviation action rule for the correlation was violated for the pair of compounds 26 and 27 during consecutive unknowns runs 90005U and 9001OU and also during consecutive unknowns runs 90036U and 90044U. These results, presented in Figure 8, also indicate that the warning rules (any pair of deviates with probability values less than 0.05) were violated twice during the 23 runs of unknowns for these 2 compounds. Figure 7 shows that compounds 26 and 27 have a high positive correlation. This positive correlation is also reflected in the composite plot in Figure 2b until un-

known run 90005U, at which point compound 26 has an extreme positive shift away from its mean for 2 runs, whereas compound 27 deviates only slightly from its mean during the same 2 runs. The deviation in compound 26 was sufficiently large on these two consecutive runs (i.e., 90005U and 9001OU) and on two additional consecutive runs (i.e., 90036U and 90044U) to trigger the extreme deviation correlation action rule. DISCUSSION The design of a composite system for the simultaneous control of many variables calls for the consideration of a number of issues, some of which are common to univariate QC, some of which are unique to multivariate QC, and others which relate to problems encountered as the number of control variables increase. The issues common to univariate QC include (1) specification of the size of type I error rates and power of the control rules, (2) determination of the minimal amount of data required for reliable estimates of characterization parameters, and (3) presentation of useful graphical interpretation of the results. The issues unique to multivariate QC include (1)accounting for the correlation structure among the variables and (2) obtaining reliable estimates of multivariate test statistics as a function of the number of variables measured and the correlation structure. Additional considerations addressed by the CMQC method proposed in this report include (1) handling missing observations among different subsets of variables during characterization and unknowns runs, (2) specifying control rules that allow users to accept a run with missing variables, and (3) relating the observed values of multivariate test statistics to the results on the individual variables. If the number of monitored variables (M) is small (M < 5), simple techniques such as Bonferroni adjustments to control limits or the implementation of binomial rules may be useful. These methods, however, do not lead to correct type I error rates and suffer from a loss of power as M increases. Although the type I error rates associated with T 2(Jackson, Hotelling) and principal components (Jackson, Fisher) are correct for the multivariate case, several problems occur, when analysts apply T 2or PC to the simultaneous control of many variables. First, T 2 and PC are applicable only if all variables are observed in each run of unknowns. As the number of measured variables increases, the likelihood of missing data increases rapidly and, for some processes, probably approaches unity. As the demand rises for state-of-the-art measurements that are near detection or quantitation limits, missing data are almost certain in low-concentration QC samples (e.g., only 15 of 40 runs had all measurements in the characterization data presented in the previous section). This missing-data problem makes it difficult to obtain a sufficient number of characterization runs to make precise estimates of the correlation matrix. Second, analysts using the T 2and PC procedures reject or accept all variables in entire runs. Although this situation may be tolerable in low-cost experiments when analysts have adequate backup samples and adequate time to reanalyze them, the use of these procedures is not even feasible in many situations. Even when it is feasible, discarding entire runs when only a fraction of the measured variables are invalid causes a duplication of effort and results in the loss of data on otherwise acceptable variables. Third, analysts have difficulty relating the outcomes of the T 2and PC multivariate test statistic outcomes to the results observed on the individual control variables. The ability to easily make a connection between the multivariate control statistic pattern and individual variables is very important to analysts because they will not want to repeat runs without adequate justification. They also need to known which var-

Anal. Chem. 1991, 63, 1425-1432

iables require attention in order to bring the process under control. The proposed CMQC method overcomes the three problems pointed out for the T 2and PC methods. First, it does not require all the variables in a run to be observed. Although the user may set an arbitrary limit on the minimum number of observed variables for a run to be considered useful, the control rules do not impose any limits on missing data. The multivariate control statistic incorporates a correction for missing data, and therefore, its critical values do not require adjustment. Second, the CMQC method allows the users, by independently applying univariate and pairwise control rules, to reject a fraction of the observed variables without discarding the entire run. The rationale for using these rules is that maverick events may affect only a fraction of the measured variables. As we discussed above, we could not place all materials in a single QC preparation; five different QC preparations were required. Thus, using our system analysts could reject any similar maverick measurements associated with the same QC preparation as a group. Third, the CMQC incorporates several unique graphical tools to aid the analyst in interpreting highly dimensional multivariate data. These tools include the composite univariate plot, which allows the user to quickly spot trends, parallelism, and maverick events for all variables. The pariwise correlation control charts are valuable for identifying specific shifts in correlation that could not be detected with only a single-parameter control system like T2.In addition, the matrix plot is useful for identifying pairs of variables whose correlations are unusual. In summary, the proposed multivariate control procedure solves several common problems with both univariate and

1425

multivariate QC. It also deals with a number of issues that are encountered as the number of control variables increase. Certain aspects of CMQC are not addressed here. These include (1)the determination of the distribution and critical values for the multivariate statistic as a function of the number of QC variables and the correlation structure; (2) the definition and determination of the multivariate statistical power of the method as a function of the number of QC variables, the correlation structure, and the number of missing variables; and (3) the robustness of the method to missing observations during characterization or analytical runs. Also, a comparison of the MCS statistic results to the T 2or PC methods would be desirable. The results of simulation analyses on these issues will be presented in a separate paper.

LITERATURE CITED (1) Westgard, J. 0.; Barry, P. L.; Hunt, M. R. Clin. Chem. 1981, 27, 493-501. (2) Jackson, J. E. Ind. Qual@ Conlrol1958, 12 (7), 4-8. (3) Hotelling, H. Multfveriete Quallty Control, Techniques of StetisHcal Analysis; McGraw-Hill: New York, 1947. (4) Hotelling, H. R o c d i n g s of the second Berkeley Symposium in Mathematical Ststistics and Robablllty; University of Caiifornla Press: Berkeley, 1950; pp 23-41. (5) Jackson, J. E.; Morrls, H. J . Am. Stet. Assoc. 1957, 52, 186-199. (6) Jackson, J. E. TechnomeMcs 1959. 1. 359-377. (7) Jackson, J. E.; Mudhoikar, G. S. Technometrics 1979, 21, 341-349. (8) Fisher, M. T.; Lee. J.; Mare, M. K. Analyst 1988, 3 , 1225-1229. (9) Murphy, B. J. The SteLticlen 1987, 36, 571-583. (10) Morrison, D. F. Multivariate Statkrlcal Methods; McGraw-Hili: New York, 1967; pp 120-121.

RECEIVED for review October 19, 1990. Accepted April 18, 1991.

Interactive Self-Modeling Mixture Analysis Willem Windig* and Jean Guilment Eastman Kodak Company, Rochester, New York 14652-3712

I n the analytkal environment, spectral data resuhg from the analysls of samples oflen represent mixtures. To extract Monnath about pure components aften Is a major problem, especldly when reference spectra are not available. For thls type of problem, selfmodellng mixture analysis technlques have been developed. Although successful commercial applications have been developed, the appllcatlon of these technlques to complex data sets requlres skilled operators. Furthermore, no general purpose soflware Is avallable. i n order to make sdlmoddhg mixture analysis more acamlbk, a new method has been developed. For the approach described here, all the lntermedlate steps can be presented dralghtforwardly In the form of spectra, and lt Is possible to dlrect the procedure by using chemlcal knowledge of the samples. Examples will be shown of Raman spectroscopic data of a reaction, where spectra of Intermediates are extracted, and of FT-IR mkroscoplc data of a polymer lamhate, where It will be shown that spectra of layers below the resolution of the FT-IR mlcroscope can be calculated.

INTRODUCTION Despite the use of hyphenated and/or high-resolution analytical instruments, the resulting spectral data often rep0003-2700/9 1/0363-1425$02.50/0

resent mixtures of several components. Furthermore, reference spectra are not always available to resolve the mixture data by techniques such as least squares or spectral subtraction. For this type of problem, self-modeling mixture analysis techniques have been developed. Generally, the term curve resolution is used for approaches like the one discussed in this paper. We would like to limit the use of the term curve resolution, where the continuous character of the concentration profile, such 89 the data that result from hyphenated instruments, is used in the algorithm, and use of term selfmodeling mixture for algorithms where no such assumptions about the concentration profiles are used. For an excellent review of factor analysis based mixture analysis, see the recent review of Gemperline and Hamilton ( 1 , 2 ) . For a more geometrically oriented explanation, see ref 3. Most of these self-modeling techniques are based on principal component analysis. Although principal component analysis is currently the state-of-the-artapproach for self-modeling curve resolution, its use is mainly limited to the use of the resolution of data of hyphenated techniques, e.g., Hewlett-Packard's Quickres (Infometrix, Inc., 2200 Sixth Ave, Suite 833, Seattle, WA 98121) ( 4 ) )which works with diode-array chromatographic detectors. Quickres is limited to resolving two components, however. Another curve resolution related program is Beckman's peak purity program (Beckman Instruments, Inc., Altex 0 1991 American Chemical Society