Anal. Chem. 2010, 82, 5972–5982
Self-Organizing Map Quality Control Index Sila Kittiwachana,† Diana L. S. Ferreira,† Louise A. Fido,‡ Duncan R. Thompson,‡ Richard E. A. Escott,§ and Richard G. Brereton*,† Centre for Chemometrics, School of Chemistry, University of Bristol, Cantocks Close, Bristol BS8 1TS, U.K., GlaxoSmithKline, Gunnels Wood Road, Stevenage, Hertfordshire SG1 2NY, U.K., and GlaxoSmithKline, Old Powder Mills, Tonbridge, Kent TN11 9AN, U.K. A new approach for process monitoring is described, the self-organizing map quality control (SOMQC) index. The basis of the method is that SOM maps are formed from normal operating condition (NOC) samples, using a leaveone-out approach. The distances (or dissimilarities) of the left out sample can be determined to all the units in the map, and the nth percentile measured distance of the left out sample is used to provide a null distribution of NOC distances which is generated using the Hodges-Lehmann method. The nth percentile distance of a test sample to a map generated from all NOC samples can be measured and compared to the null distribution at a given confidence level to determine whether the sample can be judged as out of control. The approach described in this paper is applied to online high-performance liquid chromatography (HPLC) measurements of a continuous pharmaceutical process and is compared to other established methods including Q and D statistics and support vector domain description. The SOMQC has advantages in that there is no requirement for multinormality in the NOC samples, or for linear models, or to perform principal components analysis (PCA) prior to the analysis with concomitant issues about choosing the number of PCs. It also provides information about which variables are important using component planes. The influence of extreme values in the background data set can also be tuned by choosing the distance percentile. There has been significant interest in the application of multivariate methods to process monitoring, often called multivariate statistical process control (MSPC), over the past 2 decades.1-3 Most classical approaches have been applied for monitoring using spectroscopy, especially near-infrared (NIR). Spectra are recorded regularly, and any changes in the expected spectrum suggest that there could be a problem with a process, such as impurities or unacceptable quality products. Multivariate methods have been slow to gain acceptance in industry but are gradually being incorporated, especially with the process analytical * To whom correspondence should be addressed. † University of Bristol. ‡ GlaxoSmithKline, Gunnels Wood Road. § GlaxoSmithKline, Old Powder Mills. (1) Nomikos, P.; MacGregor, J. F. Technometrics 1995, 37, 41–59. (2) Kourti, T.; MacGregor, J. F. Chemom. Intell. Lab. Syst. 1995, 28, 3–21. (3) Westerhuis, J. A.; Gurden, S. P.; Smilde, A. K. Anal. Chem. 2000, 72, 5322– 5330.
5972
Analytical Chemistry, Vol. 82, No. 14, July 15, 2010
technology (PAT) initiative.4 Most approaches currently available are based on statistical principles first proposed in the 1970s and 1980s. Concepts such as the D or T2 statistic,1-3,5-7 Q statistic, or squared prediction error (SPE)1-3,5-7 were first reported at that era and have been gradually refined. Many methods rely on determining a normal operating conditions (NOC) region and assessing whether future test samples deviate from this. There is some debate as to how to determine which samples are part of the NOC region with no universally agreed criterion; however, this could be defined, for example, by the operator, or in a regulated industry, by legal requirements. The samples to be included in the NOC could even be established using consumer tests. A weakness of most existing approaches is that they are based on linear least-squares multivariate statistics and assume that the data are multinormal. However, it is not always the case that these three assumptions of normality, linearity, and minimizing leastsquares residuals (often in the presence of outliers, asymmetry, or heteroscedastic noise) are requirements for defining an NOC region. In order to consider a wider range of methods, process monitoring can be formulated as a one-class classification problem8-11 in which the NOC samples are modeled as a group, and the aim is to determine how well a test sample, that is, a sample obtained in the future, fits into the NOC class model at a given confidence level. The D statistic can be considered as a one-class classifier using a model based on quadratic discriminant analysis (QDA) and disjoint principal component (PC) projections (that is, principal components analysis (PCA) is performed only on the NOC region). However, there are many other recent methods for pattern recognition, such as support vector domain description (SVDD)7-10,12 and self-organizing maps (SOMs),13-16 (4) (5) (6) (7) (8)
(9) (10) (11) (12) (13)
(14) (15)
Workman, J.; Koch, M.; Veltkamp, D. Anal. Chem. 2007, 79, 4345–4364. Qin, S. J. J. Chemom. 2003, 17, 480–502. Jackson, J. E.; Mudholkar, G. S. Technometrics 1979, 21, 341–349. Brereton, R. G. Chemometrics for Pattern Recognition; Wiley: Chichester, U.K., 2009. Tax, D. M. J. One-Class Classification; Concept-Learning in the Absence of Counter-Examples. Ph.D. Thesis, Delft University of Technology (NL), 2001, http://ict.ewi.tudelft.nl/∼davidt/papers/thesis.pdf. Tax, D. M. J.; Duin, R. P. W. Pattern Recognit. Lett. 1999, 20, 1119–1199. Tax, D. M. J.; Duin, R. P. W. Mach. Learn. 2004, 54, 45–66. Kittiwachana, S.; Ferreira, D. L. S.; Lloyd, G. R.; Fido, L. A.; Thompson, D. R.; Escott, R. E. A.; Brereton, R. G. J. Chemom. 2010, 24, 96–110. Brereton, R. G.; Lloyd, G. R. Analyst 2010, 135, 230–267. Kohonen, T. Construction of Similarity Diagrams for Phenomes by a SelfOrganising Algorithm; Helsinki University of Technology: Espoo, Finland, 1981. Kohonen, T. Biol. Cybernetics 1982, 43, 59–69. Kohonen, T. Self-Organizing Maps; Springer: New York, 2001. 10.1021/ac100383g 2010 American Chemical Society Published on Web 06/17/2010
that are increasingly common in machine learning. These approaches are much more computationally intensive compared to the traditional statistical methods and would have been inconceivable online in the past. But, at the time of writing with rapid increase in computing power, these calculations can be performed online using contemporary scientific desktop computers. Furthermore, methods reported in papers might find themselves useful in practical situations several years down the line when computers will be even more powerful. Hence it is opportune to investigate the potential of such approaches for process monitoring as a precursor for a new generation of methods for MSPC. Previously, the potential of SVDD for process monitoring has been reported,11 and it is shown that it overcomes many of the limitations of traditional approaches, including no longer requiring the NOC region to be multinormal. In chemistry, SOMs, which also are a form of machine learning, recently have been applied for pattern recognition,16-20 structure-activity relationship studies,21-23 calibration,24,25 and process monitoring.26-28 They have many of the advantages of SVDD and also allow insight into which variables are important by having the additional potential of determining which chemicals or regions of a spectrum are influential in the model. PCA-based methods also have this advantage but are linear least-squares approaches, so SOMs combine the advantages of PCA with those of SVDD. For process monitoring, most applications are to determine where a sample lies on a map; so, for example, the provenance of a test sample could be assessed.26-28 However, an alternative is to determine how well a sample fits into a mapsthis can be done by measuring the distance from a map. If, for example, there are 100 units (or cells) in a map, the distance or dissimilarity from each unit in the map can be determined. The greater these distances, the less likely the sample is to fit into the groups of samples that are used to form the map. In this way, SOMs can be used as one-class classifiers, the map consisting of samples just from one class (the NOC samples) with a cutoff distance (or confidence) being defined to indicate whether a sample is a member of the NOC group at a defined probability. In this paper, a new approach of using SOMs as a one-class classifier is reported. It can be used to determine how well a test sample fits the NOC region, and the method is exemplified by application to online high-performance liquid chromatography (HPLC) of a continuous process. For brevity, a single NOC region (16) Lloyd, G. R.; Brereton, R. G.; Duncan, J. C. Analyst 2008, 133, 1046–1059. (17) Marini, F.; Zupan, J.; Magri, A. L. Anal. Chim. Acta 2005, 544, 306–314. (18) Melssen, W.; Wehrens, R.; Buydens, L. Chemom. Intell. Lab. Syst. 2006, 83, 99–113. (19) Latino, D. A. R. S.; Aires-de Sousa, J. Anal. Chem. 2007, 79, 854–862. (20) Wongravee, K.; Lloyd, G. R.; Silwood, C. J.; Grootveld, M.; Brereton, R. G. Anal. Chem. 2010, 92, 628–638. (21) Kawakami, J.; Hoshi, K.; Ishiyama, A.; Miyagishima, S.; Sato, K. Chem. Pharm. Bull. 2004, 52, 751–755. (22) Zhang, Q.-Y.; Aires-de-Sousa, J. J. Chem. Inf. Model. 2005, 45, 1775–1783. (23) Guha, R.; Serra, J. R.; Jurs, P. C. J. Mol. Graphics Modell. 2004, 23, 1–14. (24) Marini, F.; Magri, A. L.; Bucci, R.; Magri, A. D. Anal. Chim. Acta 2007, 599, 232–240. (25) Melssen, W.; Ustun, B.; Buydens, L. Chemom. Intell. Lab. Syst. 2007, 86, 102–120. (26) Jamsa-Jounela, S. L.; Vermasvuori, M.; Enden, P.; Haavisto, S. Control Eng. Pract. 2003, 11, 83–92. (27) Diaz, I.; Dominguez, M.; Cuadrado, A. A.; Fuertes, J. J. Expert Syst. Appl. 2008, 34, 2953–2965. (28) Kampjarvi, P.; Sourander, M.; Komulainen, T.; Vatanski, N.; Nikus, M.; Jamsa-Jounela, S. L. Control Eng. Pract. 2008, 16, 1–13.
is chosen, using an already reported criterion11,29 although, of course, any method could be used to define such a region. The results are compared with some more established methods such as D and Q statistics and SVDD. EXPERIMENTAL DATA This work is illustrated by the monitoring of a continuous process using online HPLC. The data set used in this paper has been published previously;11 therefore, only a summary is provided in this section for brevity. More details of the process and the HPLC conditions have been described elsewhere.29-31 The reaction was monitored over 105.11 h. The first sample was taken from the process 1.13 h after the reaction started. Subsequent samples were taken from the process with 5-21 min intervals between each sample resulting in 309 samples overall. Each sample was analyzed chromatographically using a singlewavelength ultraviolet spectrometer detector at 245 nm, over an elution time of 2.5 min at a sampling rate of 13.72 scan/s resulting in 2058 data points per chromatogram. Chromatographic data were exported to Matlab version R2008b (Mathworks, Natick, MA) for further analysis. All software was written in-house in Matlab. A data matrix of 309 rows corresponding to the process times and 2058 columns corresponding to the scans was obtained. METHODS It is important that any method developed can be applied dynamically to analyze online data rather than restricted to a retrospective assessment. Therefore, to emulate real-time data acquisition, the chromatograms were sequentially input from the first to the last with only historic information available to the algorithms. Hence when analyzing, for example, chromatograms recorded over 50 h, it is assumed that no information about future samples is available so that the methods reported in this paper can be implemented in real time online. Data Preparation. Although it is possible to analyze the data using baseline-corrected and peak-aligned chromatograms (total chromatographic intensity),29,30 in this paper, the data is analyzed in the form of a peak table whose rows consist of samples and whose columns consist of chromatographic peaks. The steps in data preparation including the parameters used have been described in greater detail elsewhere and are not repeated in this paper for brevity.29 The cumulative relative standard deviation (CRSD)11,29 was used to determine which group of samples (or “window”) exhibits low variation and so could be defined as the NOC region. Once this NOC region containing Nc samples is found, chromatographic peaks are aligned according to a reference chromatogram found in this region, as discussed previously.29 Afterward, peak picking (to determine how many detectable peaks and what their elution times are) and peak matching (to determine which peaks in each chromatogram or sample correspond to the same compound) are performed to obtain a peak table. Prior to data analysis, the elements of the peak table are square-root-transformed to reduce het(29) Kittiwachana, S.; Ferreira, D. L. S.; Fido, L. A.; Thompson, D. R.; Escott, R. E. A.; Brereton, R. G. J. Chromatogr., A 2008, 1213, 130–144. (30) Zhu, L.; Brereton, R. G.; Thompson, D. R.; Hopkins, P. L.; Escott, R. E. A. Anal. Chim. Acta 2007, 584, 370–378. (31) Ferreira, D. L. S.; Kittiwachana, S.; Fido, L. A.; Thompson, D. R.; Escott, R. E. A.; Brereton, R. G. Analyst 2009, 134, 1571–1585.
Analytical Chemistry, Vol. 82, No. 14, July 15, 2010
5973
eroscedastic noise.32 This transformation can be shown to be equivalent to a Box-Cox transformation with λ ) 0.5 on standardized data. Additionally, this square-root transformation helps to ease the effect from large peaks unduly influencing the signal. Then, each row of the peak table is scaled to a constant value of 1. After that, the peak table is standardized to ensure that each peak has a similar influence on the model. Process Monitoring Methods. Although each of the methods described in this paper uses different criteria for determining whether a sample is in control, they will be compared using the same reference (or NOC) samples for modeling. The same data preprocessing is used for all the methods, and the heteroscedastic noise which is quite injurious to PCA but SOM stands is removed, so there is no bias from the data preprocessing employed. The methods used in this paper can be categorized into two groups: parametric methods such as D and Q statistics that assume the data arise from a multinormal distribution and nonparametric methods such as SVDD and SOMs that make no assumptions about the distribution of the data. In this paper, D and Q statistics and SVDD are applied to disjoint PC models33 of the preprocessed peak table of the NOC region, whereas SOMs are applied to the preprocessed peak table of the NOC region without prior PCA. The test samples are preprocessed using the mean and the standard deviation of the NOC samples. D and Q Statistics. D and Q statistics are commonly used methods for process monitoring.1-3,5-7 The distinction between these two methods is that they assess different aspects of the variation from the NOC region, which can be modeled using PCA by X ) TP + E where T represents the scores and P the loadings. The variation in the NOC samples X can be divided into two parts: a systematic variation part TP and a residual part E which is not described by the model.7 The D statistic assesses the systematic variation in the NOC region. In contrast, the Q statistic looks at the residuals from the PC model of the NOC region and sees how large the nonsystematic variation is. One of the most important considerations is to ensure that the NOC model has an appropriate number of PCs. In this study, cross-validation7,34 was used to determine the optimal number of PCs used to describe the NOC region. An alternative technique, the bootstrap,35 can also be employed; however, it is not the prime aim of this paper to compare different approaches for determining the number of PCs. In fact, for the data used in this paper, the results from both methods give identical results for the number of PCs to be used in the model and this is another advantage of the proposed SOM method as there is no need to optimize the number of PCs. A sample is classified as in control if its D or Q value falls within a predefined control limit at a given confidence level. The description of how to calculate the D and Q values and their confidence limit is presented elsewhere.1,2,5-7 In this study, the (32) Kvalheim, O. M.; Brakstad, F.; Liang, Y.-Z. Anal. Chem. 1994, 66, 43–51. (33) Li, D.; Lloyd, G. R.; Duncan, J. C.; Brereton, R. G. J. Chemom. 2010, 24, 273-287. (34) Wold, S. Technometrics. 1978, 20, 397–405. (35) Efron, B.; Tibshirani, R. J. An Introduction to the Bootstrap; Chapman & Hall/CRC: Florida, 1993.
5974
Analytical Chemistry, Vol. 82, No. 14, July 15, 2010
value of the relevant statistic is calculated for each sample and visualized together with the critical values in a bar chart. The 90% control limits for the D and Q statistics are used in this paper, and the bar graphs are plotted on a logarithmic scale, although the work in this paper could be generalized to any control limit. Support Vector Domain Description. Support vector domain description7-10,12 is a modified version of support vector machines (SVMs)12,36-38 and can be described as a one-class classifier,8-11 requiring only one group to define the model. SVDD defines a boundary around the NOC samples in the form of a hypersphere in kernel space, and this boundary can be used as a confidence limit for the model. The model does not require that the data follows a multivariate normal distribution, and when projected into data space, it can be represented as nonellipsoidal, although it is necessary to use a kernel for this purpose. The mathematical basis of SVDD is well-described elsewhere.7-10,12 In this paper, a radial basis function (RBF) was used for the kernel, since it requires only one parameter to define it.12,36 Two parameters control the SVDD model, the sample rejection rate D7,8 and the kernel width σ for the RBF. In this paper, the D parameter used for the modeling is fixed at 0.1 corresponding to a 90% confidence limit for D and Q statistics for comparison. How to optimize σ is described elsewhere;11,12 it involves finding a compromise solution that attempts to minimize the proportion of test set samples (generated using the bootstrap) rejected as belonging to the NOC (or in-control samples) while also minimizing the radius of the hypersphere that surrounds the bootstrap training set in kernel space, since the smaller the radius of the hypersphere the closer it fits to the training set samples. The distance of a sample in kernel space from the center of the model can be calculated and compared to the distance that is given by a boundary characterized by D ) 0.1. If it is below, this sample is classified as in control with 90% confidence. Although the model is bounded by a hypersphere in kernel space so the distance from the center in this space is directly related to the probability that a sample is a member of the group, the data are no longer bounded by a hypersphere when projected back into data space and are consequently enclosed within irregularly shaped regions. Hence, when projected back into data space, boundaries for different values of D (or confidence limits) are not necessarily concentric12 or of the same shape. An analogy to the D statistics can be obtained using SVDD, but because the confidence boundaries in data space are not necessarily concentric or of the same shape for different values of D, the SVDD charts differ in appearance according to the value of D, unlike for the other measures described in this paper, where the only difference is the position of the cutoff limit. Self-Organizing Maps. Self-organizing maps were first described by Kohonen in the 1980s as a method of visualizing the relationship between different speech patterns.13,14 SOMs can be considered as an unsupervised method, which is an alternative to PCA, since they can be used to present the characteristic structure of data in a low-dimensional display. However, they also can be (36) Xu, Y.; Zomer, S.; Brereton, R. G. Crit. Rev. Anal. Chem. 2006, 36, 177– 188. (37) Vapnik, V. N. The Nature of Statistical Learning Theory; Springer: New York, 2000. (38) Cortes, C.; Vapnik, V. Mach. Learn. 1995, 20, 273–295.
where b is the value of k with the most similar weight vector wk to the randomly selected sample xz: b ) arg min{s(xz,wk)} k
Figure 1. Q × R map with J weights containing a total of K map units and the corresponding weight matrix W.
modified for using in a supervised mode.20,39 SOMs are neural networks that employ adaptive learning algorithms, so they can be well-suited for many real systems.40 A SOM involves a map, which is often represented by a two-dimensional grid consisting of a certain number of units, and the aim is to locate the samples on the map such that the relative distances between them in the original data space are preserved as far as possible within this lower dimensional map. Other geometries such as circles or spheres can also be envisaged,15 but this paper uses the most widespread, rectangular representation. In process monitoring, it is difficult to generate a sufficiently representative class of abnormal samples so that only one group (in control) samples are available for modeling. Therefore, in this paper, SOMs are trained based on only one class of samples, the NOC samples, analogous to one-class classifiers. After that, a confidence limit is estimated using the jackknife.41 Training. The full SOM algorithm is described elsewhere,12-16 and only essential steps are described in this section. In this paper, a trained map consists of a total of K () Q × R) map units (usually squares or hexagons), where Q and R are the number of rows and columns of the map, respectively. Each map unit k is characterized by a weight for each variable, resulting in a 1 × J weight vector wk (Figure 1), where J corresponds to the number of variables. Therefore, each variable j has K × 1 weight units which, for this work, were generated from randomly selected values from a uniform distribution within the measured range of variable j. In the training process, a sample vector (randomly selected from the training samples) is compared to each of the map unit weight vector. The dissimilarity between a randomly selected sample xz and a weight vector for map unit k is given by
∑ J
s(xz,wk) )
(xzj - wkj)2
j)1
The map unit with the most similar weight vector is declared as the “winner” or the best matching unit (BMU) defined as follows: s(xz,wb) ) min{s(xz,wk)} k
(39) Lloyd, G. R.; Wongravee, K.; Silwood, C. J. L.; Grootveld, M.; Brereton, R. G. Chemom. Intell. Lab. Syst. 2009, 98, 149–161. (40) Mao, J. C.; Jain, A. K. IEEE Trans. Neural Networks 1995, 6, 296–317. (41) Wolter, K. M. Introduction to Variance Estimation; Springer: New York, 2007.
This BMU and its neighboring map units are then updated to become more like the selected sample. The amount that the units can “learn” to represent the input sample is controlled by the learning rate ω and the neighborhood weight β. As the learning proceeds, the samples gradually move toward a region of the map that they are most similar to, and so samples that are close together in the high-dimensional input space are mapped to SOM units that are close together in the map space. If the map space is two-dimensional, there are a number of ways of visualizing the relationship between samples, such as the U-matrix,42 hit histograms,43 and supervised color shading.16 For this work, SOM maps of different sizes (10 × 5, 10 × 15, and 20 × 30 units) were trained using 10 000 learning iterations, an initial learning rate of 0.1, and an initial neighborhood width of half of the smallest dimension of the map, as described previously.16 In each map, the quantization error (QE) for each sample i can be defined as the dissimilarity of its corresponding vector xi and its BMU. The mean quantization error (MQE)15,16 is defined as the mean QEs for all samples in the training set for a given map. Since the initial weight vector was generated randomly, the map will appear to be different each time it is generated. In this paper, five maps of each size were generated, and the maps with the lowest MQE are used for the analysis and presented in this paper. Naturally, there are many possible variants but this approach is used for simplicity and provides a demonstrable and simple protocol. SOM Quality Control Index. A summary of the steps used to obtain an SOM quality control (SOMQC) index is presented in Figure 2 and described in detail below. The first step is to determine the dissimilarity and SOM distance map for each sample. Once an SOM has been trained, the dissimilarity between a sample xi and the weight vector for each map unit k defined by s(xi,wk) can be calculated and organized into a distance map Si of size Q × R corresponding to the K grid units in the trained map (Figure 3). This map can be used to review how different a test sample is compared to the NOC samples and also which NOC samples are the most similar or dissimilar to the test sample. Fault detection can be based on the distance between an unknown sample to the trained map. When a test sample is too far (too different) from the trained map, this might be due to a new feature which could be the consequence from a faulty process. In order to identify an out of control sample, the nth percentile of the ranked distances of sample i to all units in the trained map is calculated, defined by vin (hence, if n ) 0 then the smallest distance from the map is used, whereas if n ) 50, the median is used) which will be defined as an SOMQC index. If this value exceeds a predefined confidence limit as described below, the process is considered to be out of control. The SOMQC index is different from the (42) Ultsch, A.; Siemon, H. P. Proceedings of the INNC’90 International Neural Network Conference; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1990. (43) Vesanto, J. Intell. Data Anal. 1999, 3, 111–126.
Analytical Chemistry, Vol. 82, No. 14, July 15, 2010
5975
Figure 2. Overview of the SOMQC for process monitoring as illustrated by an SOM trained using four NOC samples (in practice there will be many more samples) and n ) 50th percentile.
Figure 3. Q × R trained map and Q × R distance map Si calculated from a sample vector xi.
quantization error since the index depends on choosing the nth percentile used but the quantization error is the minimum distance of a sample vector to the map equivalent to the SOMQC index when n ) 0. 5976
Analytical Chemistry, Vol. 82, No. 14, July 15, 2010
The next step is to calculate the confidence limit. In this paper, jackknife method41 is used to calculate the confidence limit for the SOMQC index. The jackknife method as applied in this paper is similar to leave-one-out cross-validation, but it is employed to define a confidence limit rather than to validate or optimize a method. Each sample is removed in turn, and the remaining samples are used to train the SOM model. This model is used for prediction of the distances of the left out sample to all of the map units based on the remaining samples. The percentile distance, calculated using the same nth percentile as defined above, is then retained for this sample. The procedure is repeated foreachoftheNOCsamplesleftout.Afterthat,theHodges-Lehmann (HL) method44,45 is used to obtain a null distribution from the distances obtained above, and the confidence limit can be estimated from the null distribution using the steps below. The (44) Alloway, J. A.; Raghavachari, M. J. Qual. Technol. 1991, 23, 336–347. (45) Das, N. Int. J. Adv. Manuf. Technol. 2009, 41, 799–807.
Figure 4. (a) D statistic 90% confidence limit, (b) SVDD D ) 0.1 (equivalent to 90% confidence limit), and (c) Q statistic 90% confidence limit, represented by a plane. For panel c only NOC samples are shown for clarity, and all the models use projections onto two PCs of the NOC samples (disjoint PC models).
reason for the HL method is to obtain a finer distribution of distances; for example, if there are 10 NOC samples, the HL method will result in (10 × 11)/2 ) 55 measures rather than 10. The method is as follows. 1. Define a distribution characterized by M ) Nc(Nc + 1)/2 numbers. Each of the numbers equals the average pairwise similarity of samples i and l, namely: HLm ) (vin + vln)/2 where m ) 1, ..., M and i e l; i ) 1, ..., Nc and l ) i, ..., Nc. Here, Nc is the number of NOC samples. vin and vln are the SOMQC indices using the nth percentile of sample i and l, respectively, based on their trained maps. 2. Calculate the confidence limit of the null distribution where R% confidence limit is the Rth percentile of the null distribution of the M pairwise average distances. This paper uses a onetailed 90% confidence limit. The SOMQC index can be calculated for any sample whether it is part of the NOC region or not and visualized together with the confidence limit values obtained from the null distribution using the HL method, as described above, in the form of a bar chart. Component Planes. It is possible to display component planes15,16 of the trained map for each variable separately which are somewhat analogous to loadings in PCA. Component planes can be used to see how each variable influences the map and which samples a variable is most associated with. In this paper, the
component planes were scaled in the range from 0 to 1 so that each variable can be compared on a similar scale whatever its original range.
RESULTS AND DISCUSSION NOC Region. The data used in this paper were from online HPLC of a continuous process monitored over 105.11 h. The overlaid plots of the chromatograms of all 309 samples after the baseline correction and peak alignment are shown in Supporting Information Figure S-1. An overview of the peak table obtained from these data after the final sample was obtained is presented in Supporting Information Figure S-2. In this work, only the CRSD was used as a criterion for the NOC region.11,29 From the log book, samples early in the process that were recorded with known detector faults were excluded. As a result, using a window containing 30 samples and 7% CRSD as a threshold, samples 28 (8.98 h) to 59 (21.02 h) were defined as the NOC region, excluding samples 38 and 50 which had known detector faults. It is not the prime aim of this paper to compare approaches for determining the NOC samples. Although the choice made in this paper is typical of real-time process decisions, the NOC samples are early in the process, consisting of a sufficiently large number of samples, and exclude samples with known detector faults. The distribution of samples in this region is not described by a multinormal distribution as it fails the Doornik-Hansen Omnibus multinorAnalytical Chemistry, Vol. 82, No. 14, July 15, 2010
5977
Figure 5. U-matrix visualizations for the trained map consisting of the 30 NOC samples where the numbers represent the BMU of each sample.
mality test46 at 95% confidence level using two PC models. Most methods for MSPC assume multinormality in the NOC region, but it may require long process runs before such conditions are found, if at all, and so methods for MSPC should be able to provide process information even if the samples in the NOC region are not normally distributed. In this paper, a locked peak table11,29 is used which means that the number of peaks considered during analysis was fixed according to the number of peaks detected in the NOC region, and new peaks detected after the NOC region were not considered as part of the model. Since the peaks from the mobile phase did not provide any information about the process, they were also removed from the model. A total of 12 peaks were used for monitoring the reaction. All samples after the NOC region whether they corresponded to known detector or suspected process faults were assessed by the modeling. After the data preprocessing was performed as discussed in the Data Preparation section, and the last sample was obtained from the process, a data matrix of dimensions 309 × 12 was used for analysis, where each row corresponds to samples obtained as the process evolves and each column corresponds to the intensity of the detected chromatographic peak. Optimization of PC and SVDD Models. The optimum number of PCs selected for the defined NOC region was (46) Doornik, J. A.; Hansen, H. Bull. Oxford Univ. Inst. Econ. Stat. 2008, 70, 927–939.
5978
Analytical Chemistry, Vol. 82, No. 14, July 15, 2010
determined using the cross-validation7,34 It is found that two PCs can be used to describe the NOC samples. The optimum σ value for SVDD using the bootstrap with 100 repetitions in the NOC region was found to be 2.52; here, the data have been standardized prior to PCA. NOC on the First Two PCs Space. Parts a-c of Figure 4 show the score plots of PC2 against PC1 for D and Q statistics with 90% confidence limits and SVDD with D ) 0.1. In this paper, all the models use disjoint projections onto two PCs of the NOC space. From the PC space, it can be seen that there are two groups of NOC samples, a larger one containing the majority, on the left of the plot, and a smaller one containing samples 34, 39, 40, and 51 on the right of the plot. Using the D statistic, the samples in the smaller group are excluded from the model and classified to be out of control at 90% confidence, whereas using SVDD, the boundary is more flexible, and therefore, it extends to cover the smaller group. Since 90% control limits are used, it is anticipated that around three samples will be misclassified. In contrast, the Q statistic provides different results from the D statistic, because it considers the data from a different point of view; the Q statistic models the Euclidean distances from the PC space, but the D statistic models the Mahalanobis distances in the PC space. SOM for Process Monitoring. Trained Maps. SOMs of sizes 10 × 5, 10 × 15, and 20 × 30 units were trained using the preprocessed peak intensities of the NOC region. In this paper,
Figure 6. The 10 × 15 map unit component planes for each variable (chromatographic peak) where the BMU for each of the NOC samples is labeled, corresponding to the 10 × 15 map of Figure 5.
the trained maps are visualized using a U-matrix: the similarity between a unit and its neighbors is represented. A low value in the U-matrix implies similar neighbors (e.g., in the middle of a cluster of map units), whereas a high value represents parts of the map where there are dissimilar neighbors (e.g., on the
boundaries between different clusters or between an outlier and the main sample cluster). These visualizations are shown in Figure 5. The use of higher resolution describes the data in more detail as there are more interpolation units (units that are not the BMU of any training sample and represent the transition between Analytical Chemistry, Vol. 82, No. 14, July 15, 2010
5979
Figure 7. Plots of the numbers of out-of-control samples from the SOM model using a 90% confidence limit against the percentiles used to calculate the SOMQC index ranging from 0 to the 100th percentile.
adjacent units), but it takes longer for training. From the U-matrices, four samples (nos. 34, 39, 40, and 51) are clustered separately from the rest (in the top right of the 10 × 15 map) as the boundary between these samples and the others is high. This corresponds to the out of control samples in Figure 4, suggesting that SOMs are also able to pick up outliers. Since these SOM maps are formed from the NOC region, it is unlikely that there will be very strong outliers, although there will be a few samples that are on the extremes of the distribution. Component Planes. Component planes can be used to visualize where variables have been mapped onto the SOM, and labeling allows us to see which samples are best associated with these variables. Component planes for each variable using the 10 × 15 trained map are presented in Figure 6. Similar results can be achieved from the maps with different sizes. The 10 × 15 trained map is chosen as a compromise between training time and map resolution. From the component planes, peaks 3, 4, and 5 appear to be correlated (although peak 3 is inversely correlated to peaks 4 and 5). Peak 3 is the only peak not detected in all samples and is missing in samples 34, 39, 40, and 51, so the component plane is heavily weighted toward the remaining samples. It is not anticipated that there will be too many strong trends for peaks in the NOC regions, although component planes do illustrate whether there is a relationship between peak intensities and samples. If there is not much variation, the component planes will be quite flat. SOM Chart. In order to investigate how different a sample is from the trained map, the SOMQC index is calculated. SOM models can be constructed using different value of n. As an illustration of the influence of n, a graph of the number of samples from the process found to be out of control using a 90% control limit and the nth percentile ranging from 0 to 100, separated into samples where peak 3 both is and is not detected, is presented in Figure 7. To interpret this result, it is important to remember that, in the NOC, there is a group of four extreme samples where peak 5980
Analytical Chemistry, Vol. 82, No. 14, July 15, 2010
Figure 8. D, Q, SVDD, and SOMQC charts.
3 is not detected, but this is not considered particularly abnormal. Employing higher percentiles uses similarities to more extreme NOC samples as part of the assessment criterion, and as a result, the number of samples judged to be out of control without peak 3 detected decreases with a reverse trend for samples containing peak 3. Therefore, the sensitivity of the SOM model can be tuned. A high value of n makes the method for process control less sensitive to dissimilarities from the extreme NOC samples. However, it is not recommended to use values of n that are too large or too small. When low values of n are used, samples must be particularly similar to the average NOC samples to be classified as in control, whereas, when large values of n are used, extreme features in the NOC region might influence the model too much, which can be especially tricky if the NOC distribution is asymmetric. Variation in n is less influential when multinormal data is used. Below, results with n ) 50 are reported, as a compromise. A SOM chart for the overall process is shown in Figure 8. The chart was compared with the results using D and Q statistics and SVDD using a 90% confidence limit. It can be seen that the D statistic and SVDD show similar trends. This could be because both methods analyze the variation on the PC space. The difference in results could be due to boundary shapessSVDD is more flexible and is fitted to the overall data, whereas the D statistic boundary is primarily based on the density of the bigger group. In contrast, the Q statistic and SOMQC are expected to share similar patterns because both focus on the Euclidean distance of a sample onto the model space (the Q statistic looks
Figure 9. SOM distance maps for (a) sample 31 (NOC), (b) sample 61 (in control), (c) sample 81 (out of control), and (d) sample 93 (in control). The SOM 90% control limit is represented by a plane, and the distances to each sample in the NOC SOM map are represented by a grid.
at the Euclidean distance between a sample and the PC space, whereas SOM looks at the Euclidean distance between a sample and the trained map). The Q statistic shows most samples to be out of control, probably because of overfitting. It would normally be desirable to use a higher confidence level or to adjust the NOC region as the process progresses. This is typical of the behavior of the Q statistic. In contrast, the D statistic is often underfit accepting too many samples that are out of control. The SOMQC test appears to be a compromise between Q and D statistics: many failed samples are close to the 90% control limit. For this paper, a 90% control limit is chosen because this clearly shows the difference in performance of the methods. In practice, a higher limit might be employed, and this can be adjusted based on the process engineer’s experience. In addition, the performance of SOMQC can be changed by the tunable parameters employed. It is possible to visualize how a test sample behaves based on the NOC region by plotting a contour plot of a distance map on top of the U-matrix. The contour plots of the distance maps for samples 31 (NOC), 61 (in control), 81 (out of control), and 93 (in control) are presented in Figure 9. It shows that sample 31, which is one of the NOC samples, shows the most similarity to its BMU, as expected. Most of the distances in samples 61 and 93 are under the threshold; therefore, they are considered in control. Sample 81 is out of control. However, it can be seen from the distance map profile that this sample is more similar to those extreme samples in the small cluster of the NOC samples, and it was also found that peak 3 was not detected in the chromatogram of sample 81.
CONCLUSION SOMs, although common in many areas of science, have as yet had limited application to analytical chemistry and are not wellestablished in the area of MSPC. In this paper, a new approach that allows SOMs to be used quantitatively to determine whether a sample is in control at a given confidence level is described. SOMs have big advantages if the NOC region is not normally distributed: the assumption of multinormality is not always obeyed or tested for, and for a continuous process, it may take time before such a region is found. Often process operators can assess whether they find a group of samples acceptable, and this usually does not require normality. The SOMQC index measures distance or dissimilarity of samples from a map obtained from the NOC samples, rather analogous to the Q statistic, but appears to have fewer problems with overfitting. Furthermore, it is not necessary to determine the optimum number of PCs for disjoint modeling of these samples because SOMs can be computed directly on the raw data. The SOMQC index can be tuned according to whether the user wishes the extreme NOC samples to be included in the assessment or not, but it is recommended to use an intermediate pathway as the usual default. SOMQC appears to have a performance midway between the D statistic (which is often prone to underfittingstoo many samples accepted) and the Q statistic. Unlike SVDD it is possible to visualize the influence of variables on the NOC modelswhich can provide chemical insight. The methods in this paper are illustrated by a long thin data set, with many less variables than samples, but are applicable to situations where the variables exceed samples, subject to limitaAnalytical Chemistry, Vol. 82, No. 14, July 15, 2010
5981
tions in computing power. It is anticipated that over the next few years, with the rapid increase in computing power, there will be flexibility to re-evaluate the traditional statistical approaches for process monitoring and to employ new methods, side by side, that are more computationally intensive but originate primarily from pattern recognition.
works from the Commission on Higher Education, Government of Thailand (S.K.).
ACKNOWLEDGMENT This work was supported by GlaxoSmithKline (D.L.S.F.) and the Strategic Scholarships Fellowships Frontier Research Net-
Received for review February 10, 2010. Accepted May 28, 2010.
5982
Analytical Chemistry, Vol. 82, No. 14, July 15, 2010
SUPPORTING INFORMATION AVAILABLE Additional information as noted in text. This material is available free of charge via the Internet at http://pubs.acs.org.
AC100383G