Consensus Outlier Detection Using Sum of ... - ACS Publications

Apr 3, 2017 - Brett Brownfield and John H. Kalivas*. Department of Chemistry, Idaho State University, Pocatello, Idaho 83209, United States. •S Supp...
0 downloads 0 Views 438KB Size
Subscriber access provided by UNIV OF REGINA

Article

Consensus Outlier Detection using Sum of Ranking Differences of Common and New Outlier Measures without Tuning Parameter Selections Brett Brownfield, and John H. Kalivas Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.7b00637 • Publication Date (Web): 03 Apr 2017 Downloaded from http://pubs.acs.org on April 8, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Analytical Chemistry is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Consensus Outlier Detection Using Sum of Ranking Differences of Common and New Outlier Measures Without Tuning Parameter Selections Brett Brownfield† and John H. Kalivas†* †Department of Chemistry, Idaho State University, Pocatello, Idaho 83209, United States *[email protected]

ABSTRACT: Sample outlier detection is imperative before calculating a multivariate calibration model. Outliers, especially in high-dimensional space, can be difficult to detect. The outlier measures Hotelling’s t-squared, Q-residuals, and Studentized residuals are standard in analytical chemistry with spectroscopic data. However, these and other merits are tuning parameter dependent and sensitive to the outlier themselves, i.e., the measures are susceptible to swamping and masking. Additionally, different samples are also often flagged as outliers depending on the outlier measure used. Sum of ranking differences (SRD) is a new generic fusion tool that can simultaneously evaluate multiple outlier measures across windows of tuning parameter values thereby simplifying outlier detection and providing improved detection. Presented in this paper is SRD to detect multiple outliers despite the effects of masking and swamping. Both spectral (x-outlier) and analyte (y-outlier) outliers can be detected separately or in tandem with SRD using respective merits. Unique to SRD, is a fusion verification processes to confirm samples flagged as outliers. The SRD process also allows for sample masking checks. Presented, and used by SRD, are several new outlier detection measures. These measures include atypical uses of Procrustes analysis and extended inverted signal correction (EISC). The methodologies are demonstrated on two near infrared (NIR) data sets.

The key to forming a multivariate calibration model for quantitative analysis is outlier detection. Specifically, if the sample set for forming the calibration model has outliers present, then the calibration model will be useless.1 Working with outliers is a long standing problem1-12 and well-reviewed.9-12 Two approaches are common for assessing outliers. One involves detection and removal of outliers1-5 and the other makes use of robust modeling methods6-9 to reduce outlier influences. Discussed in this paper are outlier detection and removal. A general method to test for the presence of influential samples is available thereby allowing one to decide to identify the samples or apply robust methods.13 It has recently been proposed that characterizing a single sample as an outlier by one measurement is not adequate. Instead a method was developed for creating a sample-wise distribution for the chosen outlier measure by using a cross-validation (CV) process to form sample subsets followed by calculation of the outlier measure.10,11,14 While not used in this study, the CV process can be incorporated to further enhance the novel outlier detection strategy proposed in this paper. The two most commonly used spectral outlier measures (xoutlier) in analytical chemistry are Hotelling’s t-squared (linked associated with Mahalanobis distance (MD)) and Qresiduals. The most frequently used outlier measure for analyte prediction (y-outlier) is the Studentized residuals. However, these and other merits require optimization of a tuning parameter, a confounding problem with no accepted solution. Complicating the tuning parameter selection obstacle is deciding which outlier merit to use. In a study on comparing six

outlier measures, it was observed that the effectiveness of each measure to correctly identify outliers depended on several factors.3 Some issues identified are the actual sample distributions relative to the measured variables, type of outliers, and the number of outliers. It was recommended that a large number of measures be used, but how to effectively leverage this information was not considered. Graphical analysis of outlier measures was suggested over setting statistical thresholds based on assumed underlying probability distributions. In analytical chemistry, a common graphical assessment is to plot MD against Q-residual (both at fixed tuning parameter values) forming an outlier detection map to identify outlier samples based on selected thresholds or confidence levels. To solve the double problem of which measures to use and respective tuning parameter selection, proposed in this paper is to use fusion of multiple x- and y-outlier measures over windows of tuning parameter values. For example, the tuning parameter for MD is the number of eigenvectors. In this case, a collection of MD outlier values across windows of tuning parameter values is obtained. Specifically, windows of MDs based on 1 eigenvector, 1 and 2 eigenvectors, 1 through 3 eigenvectors, and so on until the stop point. With fusion, a consensus of whether a sample is an outlier or not is obtained thereby reducing the chance of missing an abnormal sample. Because a sample set often contains multiple outliers, the outlier detection problems of masking and swamping can occur.9,12,15 Masking arises when outliers cause other outliers to appear as normal samples. When such outliers are removed, samples not previously flagged are now correctly identified as

ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

outliers. Swamping causes normal samples to appear as outliers. With swamping, an outlier sample has a sufficient effect on an outlier measure to make a normal sample appear as an outlier by that measure. Several approaches attempt to rectify masking and swamping. One approach uses outlier measures robust to these effects. For example, replacing the classical mean and covariance matrix in the MD with robust estimates7,16 as well as using the minimum covariance determinant or the minimum volume ellipsoid.6,17 Another strategy computes an outlier measure over a CV process using multiple sub-sample sets of the full calibration set.10-12,14 However, these methods are tuning parameter dependent and usually a statistical threshold is necessary. A fusion process is used in this paper to counter masking and swamping due to multiple outliers. The method of sum of ranking differences (SRD) is a generic fusion process for simultaneous evaluation and comparison of multiple variables over a collection of objects.18-20 It was recently used for automatic tuning parameter selection by fusion of multiple model quality measures.21 Presented in this paper is the application of SRD as an objective fusion process to ensemble multiple x- and y-outlier merits over tuning parameter windows for simultaneous evaluation in order to rank a sample as to the degree it fits into the calibration domain. With fusion, there is no limit to the number of measures that can be used. In addition to common outlier measures, the SRD process presented in this paper uses several dissimilarity measures from the literature. Also used are atypical measures based on Procrustes analysis22,23 and extended inverted signal correction (EISC),24,25 a reversal of the commonly used extended multiplicative signal correction (EMSC).26,27 Another novel component to using SRD for outlier detection is the flexibility to confirm a suspect sample as an outlier. After SRD identifies a sample as an outlier, an SRD fusion verification process can be used. Additionally, a variation of the SRD input data can be used to check for masked samples. These SRD evaluations allow for a more resilient outlier detection process since each of the three SRD fusions has a unique viewpoint on the data. To demonstrate the potential of SRD for outlier detection, two near infrared (NIR) data sets are investigated. Both have multiple analyte prediction properties thereby allowing testing SRD with x- and y-outlier measures singularly and in combination. One data set is spectra of corn samples28 and the other is of biscuit (cookie) dough.29

OUTLIER MEASURES Spectral Measures (x-Outlier). To assess if a sample spectrum in a set of n samples is an outlier, the sample is removed and a mathematical measure of similarity (or dissimilarity) is used to compare the sample spectrum removed xi to the remaining sample spectra Xn-1. Several outlier measures are possible. The first series of measures uses the mean spectrum vector x of Xn-1 to characterize the remaining sample space for comparison to xi. These measures are listed in Table S-1. Of these measures, the three most common are the cosine of the angle (cosθi) between the vectors (eq. S1 expressed for minimiza-

Page 2 of 10

tion), Euclidean distance (eq. S2), and the determinant (eq. S3).30 If a sample is an outlier, then the cosθi shape difference measure will be small (0 ≤ cosθi ≤ 1) with a value of 0 for orthogonal vectors and 1 for collinear vectors. The Euclidean distance measures magnitude differences and is large for dissimilar samples. A large determinant denotes dissimilarity where the determinant measure the hyper-volume spanned by two vectors and simultaneously considers magnitude and angular (shape) relationships. The spatial inner product correlation (eq. S4 expressed for minimization) characterizes the correlation between the two outer product matrices Xi ( X i = xi xiT ) and X ( X = xx T ) . Smaller values indicate dissimilarity between xi and x .31 The outer product has been found to be useful in other correlation studies.32,33 The Procrustes analysis (PA) process transforms a data set to resemble another data set.22,23 For example, unconstrained PA is commonly used in calibration transfer where a transformation matrix T is sought that maps a set of spectra measured on one instrument, X2, to appear as if the spectra were measured on another instrument, X1, for which there is a preexisting calibration model. The transformation matrix obtained by solving the equation X1 = X2T represents the amount of rotation and dilation needed to make X2 resemble X1.34 Translation is another PA step accomplished by a-prior mean centering respective data sets. In PA, the effectiveness of T is determined by how well X2 is transformed to resemble X1 that is assessed by the sum of the squares of the differences between X1 and the transformed set X2T (also known as the square of the Frobenius norm of the difference matrix, 2

X1 − X 2T F ). For outlier detection, unconstrained PA is used in a novel way that has not been explored before. In this case, the sample outer product Xi is transformed by Ti to the mean sample outer product X by X = Xi Ti (eq. set S5). Instead of assessing the transformation effectiveness, the degree of difficulty (amount of work) needed is determined for outlier detection. The more difficult it is to make the transformation, the greater the amount of rotation and dilation needed and the more likely the sample is an outlier. This information is contained in Ti, but in order to quantify the information, a reference point (base) transformation matrix is needed. The base T used is the transformation matrix needed to self-transform X to X by X = XT (eq. set S6). The respective transformation matrices are calculated as shown in eq. sets S5 and S6 using rank one singular value decomposition (SVD) inverses or more simply, due to the vector outer products,

Xi+ = ( xi xi

) (1

xi

2

)( x

i

xi

)

T

= x i xiT xi

4

with a similar

+

equation for X . With both transformation matrices computed, the Frobenius norm of the differences between Ti and T is obtained (eq.S7) and the greater the norm, the more dissimilar xi is to x . In order to maximize the degree of difficulty in making the transformation, the translation step was omitted (each outer product matrix was not mean centered to itself).34 The difference between respective translation means could be used as another outlier measure, but it was not.

2

ACS Paragon Plus Environment

Page 3 of 10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

The other type of PA is constrained PA where rotation and dilation are separated into an orthogonally constrained rotation matrix H i and a dilation constant ρi (eq. set S8) when transforming Xi to X by X = ρ i X i H i . Similar to unconstrained PA, baseline operators H and ρ are obtained for self-transformation of X by X = ρ XH (eq. set S9). Rather than determining the quality of the transformation, the Frobenius norm of the difference between Hi and H and the absolute value of the difference between ρi and ρ are used to measure the extent of work required to carry out the transformation of Xi to X (eq. set S10). Due to the vector outer products, simplifications are possible relative to the respective vectors, but only the general equations are presented. Similar to unconstrained PA, the translation step was left out. Inverted multiplicative signal correction (ISC)24 is the inverted form of the commonly used multiplicative scatter correction (MSC)26 and the same statement can be made for extended ISC (EISC)25 relative to EMSC.27 These spectral preprocessing methods are used to correct for light scattering and other non-analyte signal forming components in a spectrum. The fundamental operation is similar to PA; correct a spectrum to be similar to a target spectrum as a function of multiple parameters. Used in this paper is a variant of EISC that focuses the transformation on spectral differences35,36 between xi and x ( d = x − xi ) and is referred to as EISC differences (EISCD). A multitude of correction terms are possible, but only those listed in eq. set S11 were used. Analogues to PA, rather than using corrected spectra, outlier detection is based on the difficulty of the correction. The correction vector b is calculated and two outlier measures in eq. set S12 are used; the Euclidean norms of the correction vector b and the amount of correction actually applied Xcb . These two measures are also calculated by interchanging xi with x for four outlier measures. The greater the values, the more likely the sample removed is an outlier. This approach to EISC (or the other signal correction methods)24-27,35,36 has not been utilized before. The second series of outlier measures are based on comparing the sample spectrum removed xi to the full calibration space of the remaining samples Xn-1 by spanning this space with eigenvectors from the SVD of Xn-1 (or another decomposition).The measures used in this study are described next and compiled in Table S-2. A common outlier detection measure in this situation is the Mahalanobis distance (MD) that determines the distance between the sample removed and the centroid of Xn-1 (eq. set S13).1,37 The generalized inverse of the mean-centered covari% is calculated using the SVD by C% + = V S −1U T ance matrix C k k k where k represents the tuning parameter value for the number of respective eigenvectors and singular values retained. How a value for k is determined is not standardized in application of MD. The value used can make a sample an outlier or not an outlier depending on the spectral relationships captured by the tuning parameter relative to the sample being evaluated. The approach taken here is novel and instead of selecting a value for k, a range is used to form a collection of MDs and all the

MDs are used in the outlier detection process. To determine the window size, a simple rule is used; eigenvectors and singular values are included starting with the first and end when approximately 99.9% of the Xn-1 information determined from the SVD is captured and the corresponding eigenvectors are not excessively composed of noise structure. A consensus from the MD measure is obtained by using a tuning parameter window. Leverage and Hotelling’s t-squared values are closely related to MD, and therefore only MD is used. Another standard spectral outlier measure in this series is Qresidual. The Q-residual is an assessment of the extent a sample is orthogonal to the calibration space.1,37 The residual sample spectrum removed xi⊥ is calculated from the orthogonal projection operation shown in eq. set S14. A series of Qresidual values are used over the same tuning parameter window used for MD. Related to the Q-residual is the angle between the sample spectrum removed and the calibration space. The sine of this angle (sinβ) has been shown to be useful as a classification measure (eq. S15).38 Thus, sinβ is used as an outlier measure to further enhance the consensus amongst all the outlier measures. The last spatial merit is the divergence criteria (DC).39 The DC measures the difference between two outer product spaces and respective generalized inverses. In this case, one outer product is for the sample spectrum removed and the other outer product is for the remaining samples (eq. set S16). The generalized inverse for this outer product is based on the corresponding SVD. Instead of the usual selection of a single tuning parameter value for the inverse, a window is used for a collection of inverses and hence, DC values. Analyte Measures (y-Outlier). Since a purpose of cleaning the calibration space is to form a reliable prediction model, merits that include prediction residuals (errors) are important. In this study, partial least squares (PLS) is used to form all prediction models.1 Two y-outlier measures are described next and listed in Table S-3. The externally Studentized residual1,40 (t) is a more robust version of the usual sample prediction residual y − yˆ 37,40 and is used here in an absolute form (eq. set S17). The usual approach with this y-outlier measure using PLS is to select a value for k representing the PLS tuning parameter for the number of latent variables (LV) to form the corresponding regression vector and sample prediction yˆ . In this study, a range of LVs is used. The window is determined by performing a leave multiple out CV (LMOCV) on the sample set to be cleaned and several LVs before and after the minimum are used. This can be graphically determined by the user or in an automated mode. Using the corn data for moisture (described in the Experimental section) as an example, Figure S-1 contains a plot of the root mean square of cross-validation (RMSECV) against the L2 norm of the PLS regression vectors for 300 splits with 40% of the samples randomly removed for validation on each split. Any LV window size can be selected, but distinct under and overfitted models are best avoided. Values in the plot shown in Figure S-1 form the moisture LV window. For automatic selection, an odd total number of

3

ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

number of LVs needs to be chosen, e.g., 7 LVs and three before and three after the minimum in RMSECV would be used for the outlier measures. The other y-outlier merit stems from a new measure that assesses the quality of how well the chemical and physical matrix of the sample removed matches the remaining samples.41 The premise is that if samples are matrix matched, then the interaction of a regression vector b with respective sample spectra should be similar. The interaction behavior is captured by plotting the absolute value of the prediction errors against l scalar values of α that adjust the magnitude of the selected PLS regression vector b used to form the respective prediction errors. Essentially, plotting y j − α l yˆ j against αl for the jth sample. This plot forms a V-shaped curve with zero prediction error at the bottom. The more matrix matched the samples, the more similar the α values at the bottom of the V-curves. Respective V-curve shapes will also be similar for matrix matched samples, but this component was not used in this study.41 Eq. set S18 shows the respective calculations for computing at a particular PLS LV model the α value at zero prediction error for each sample (min. αj), the mean min. αj value for the remaining samples, and the difference between the min. αi value for the sample removed and the mean value of the remaining samples. The more the sample removed is an outlier (not matrix matched) the greater the difference between the respective min. αj for the sample removed and the mean min. αj value of the remaining samples. Instead of selecting an LV, the same LV window used for the Studentized residuals is used.

OUTLIER DETECTION AND VERIFICATION SRD Detection. The sum of ranking differences (SRD) is generic and can be used to compare the similarity (or dissimilarity) of any set of variables (columns of the SRD input matrix) across any set of objects (rows).18-20 The comparison is relative to targets for each object that can be specific values or general such as minimum. The result of SRD is a ranking of the variables towards meeting the target; the lower the rank the more that variable meets respective object target values. For example, SRD was used to compare models (essentially tuning parameters) for model selection. In this use, the variable columns of the SRD input matrix were PLS models at different LVs with a collection of measures of model quality as the object rows.21 It was similarly used to select multiple tuning parameters for penalty based regression methods.42 The actual SRD computational components to rank the columns are well explained18-21,42 and MATLAB and Excel code are available.43,20 For outlier detection, all outlier measures are rows for the SRD input matrix and columns correspond to the respective removed sample being evaluated as a potential outlier. Thus, the SRD input matrix is m × n for m outlier measures. Since numerous measures are assessed in the SRD input matrix, all values are row-wise scaled to unit length. Merits included in the SRD input matrix can be spectral x-outlier measures, analyte prediction y-outlier measures, or both. All three situations are studied in this paper. To detect outliers, the SRD target value is set to maximum, and hence, SRD ranks all the sam-

Page 4 of 10

ples such that the sample removed (column) with the lowest rank is the sample most consistently dissimilar across all the outlier measures relative to the remaining samples. In other words, the lower the sample ranking, the more dissimilar the sample. This use of SRD is reversed from the usual use of SRD where the lower the SRD ranking, the more similar the samples. Available with SRD is the comparison of ranks by random numbers (CRRN) based on a normal distribution of rankings when the number of rows is greater than 13 as is the situation in this study.19 The CRRN allows determining the probability an SRD ranking is no different than a random ranking. To determine if a sample is an outlier, a threshold is used on the number of standard deviation (σ) units (and hence confidence level) from the mean of the random ranking distribution in the smaller ranking direction. While any sample with a ranking smaller than the ranking at the σ threshold could be labeled an outlier, only the lowest ranked sample beyond the σ threshold is deemed an outlier and removed after the SRD verification processes confirms the sample is an outlier. This avoids masking and swamping effects. In general, a threshold of 3σ is recommended but another can be used as described in the Results and Discussion section. The 3σ cutoff helps ensure any sample ranked lower has a very small probability of having a random rank. Proximity of SRD rankings reveals which samples are similar to each other and therefore sample groupings are possible. Because sample groupings can be due to masking, samples are sequentially removed. The overall approach presented in this paper is sequential where the most extreme SRD ranked sample beyond the σ threshold is found, checked by two variations of SRD (as described in the next sections), and then removed. This process continues until no samples are ranked lower than the σ threshold. To further reduce the chance of a random rank, an internal SRD CV (SRDCV) is used consisting of a random LMOCV on the rows of the SRD input matrix. Because tuning parameter windows are used, the merits in Table S-1 were considered as one block and each of the individual merits in Tables S-2 and S-3 were respective blocks for the corresponding tuning parameter windows. The LMOCV was performed on each block 50 times with 15% of each merit block randomly removed before computing SRD ranks. The number of splits and percent left are user defined, but 50 splits and 15% were found to be satisfactory. The goal of the LMOCV is to obtain a boxplot of the SRD rankings for visual inspection to assess the rank consistency of the samples removed. Such boxplots are presented in the Results and Discussion section. SRD Outlier Verification. The SRD approach can be used to confirm if a removed sample identified as an outlier is indeed an outlier before permanent removal. The verification procedure is similar to cleaning the sample (calibration) set except the suspect sample is first removed from the sample set. Outlier measures are then calculated for the suspect sample relative to the remaining sample set. Values for the outlier measures are also obtained using the sample removal procedure on the remaining sample set with the suspect sample removed. Next, an SRD is performed and if the suspect sample is an outlier, its rank will stay the same or decrease and stay below the σ threshold. If it is not ranked below the σ thresh-

4

ACS Paragon Plus Environment

Page 5 of 10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

old, then swamping or masking could be occurring and masking can be checked by the SRD procedure described in the next section. SRD Masking Check. The SRD method can be used to identify if any samples similar to an outlier sample (or suspect sample) are present in a sample set. The outlier measures computed for the verification process are used here. Instead of using maximum as the SRD ranking objective, the actual outlier merit values for the outlier sample are used. With this target, samples ranked lowest are those with outlier measure values closest to the outlier sample. If a masking sample is present, it will rank below the σ threshold.

EXPERIMENTAL Software. Matlab code written by the authors was used for all outlier merit calculations. The Matlab and Excel SRD code is downloadable.20,43 The outlier code is available from the authors. Data Sets. Corn. The corn data consists of the same 80 corn samples measured from 1100 to 2498 nm at 4 nm intervals for 350 wavelengths on three near-infrared (NIR) spectrometers labeled m5, mp5, and mp6. Analyte values provided are moisture, oil, protein, and starch.28 The spectra plotted in Figure S2 are measured on mp6 and used to demonstrate the SRD detection process. All four analyte properties are studied as the y-outlier values change per property. Spectra measured on m5 are plotted in Figure S-2 and these samples are used in the SRD masking study. For the x-outlier measures, the eigenvector window from 1-15 is used. Specifically, the measures are computed with eigenvector 1, then eigenvectors 1-2, and so on up to the window of eigenvectors 1-15. All 15 windows are used in SRD. For the y-outlier measures, the PLS LV windows are predication property specific with LVs 3-20, 3-17, 3-18, and 4-17 used for moisture, oil, protein, and starch respectively. The PLS LV window was empirically determined as described in the previous y-outlier section using 300 random splits with 60% in the calibration set. Biscuit. The second data set consists of 72 biscuit dough samples measured in the NIR from 1100 to 2498 nm in steps of 2 nm for 700 wavelengths.29,44 As with the original work, the region from 1380 to 2400 nm was used. The wavelength interval was increased to 4 nm for a final total of 256 wavelengths.29 Analyte values are provided for percent fat, sucrose, dry flour, and water. The data has a literature split with 40 samples in the calibration set. The 40 sample calibration set is plotted in Figure S-3 and studied to evaluate the effects of the SRD σ threshold. For the x-outlier measures, the eigenvector window from 1-9 is used. For the y-outlier measures, the PLS LV windows were LVs 3-10, 2-9, 2-8, and 2-10 for fat, sucrose, dry flour, and water respectively.

RESULTS AND DISCUSSION The general outlier detection approach presented in this paper is sequential where the most extreme sample beyond the SRD σ threshold is identified, verified, and removed. This is continued until a cleaned data set is obtained that could now be used to form a calibration model.

For comparison, presented first are results from using the two standard measures Mahalanobis distance (MD) and Qresidual on the corn data predicting moisture. The MD and Qresidual are usually computed at a specific number of eigenvectors (principal components) and often evaluated simultaneously by plotting one against the other and empirical thresholds are then used based on an assessment of the graph. Instead, the two measures are evaluated by SRD. Because a window of tuning parameters can be used with the SRD process, the MD and Q-residuals were evaluated by SRD over the same eigenvector window used with multiple x-outlier measures. Following outlier detection by the two standard measures, outlier detection is performed on the corn data set for moisture using the collection of x-outlier measures, the y-outlier measures, and then the x- and y-outlier measures together. An example of verification and masking is then characterized. Completing the paper are studies on the effect of the number of multiple outliers and the σ threshold relative to the corn and biscuit data. Corn Data. Standard x-Outlier Measures. Shown in Figure S-4a are images of the MD and Q-residual for the first 15 eigenvectors with the MD on top. This image constitutes the SRD input matrix where column values correspond to MD and Q-residual values when that sample is out. The SRD normalized ranks are shown in Figure S-4b along with the random number distribution. The 3σ limit of the distribution is shown in red. From this plot, it is clear that using MD and Q-residual with SRD does not identify any outlying samples. The box plot in Figure S-4c was generated by performing SRDCV previously described on the SRD input matrix. Again, no outliers are identified. All x-Outlier Measures. By using multiple measures, the robustness of the detection process is enhanced since each merit has a unique perspective on the data. Shown in Figure 1a is the SRD input matrix using all the x-outlier measures in the order of Tables S-1 and S-2 for the corn data. The SRD results in Figures 1b and 1c clearly show samples 75 and 77 as xoutliers. These samples are highlighted in Figure S-5 and are not obvious full spectral outliers. The MD and Q-residuals were unable to detect these two outliers because most likely, samples 75 and 77 are masking each other. Using multiple outlier measures allows detection. After sample 75 is removed and the SRD process is repeated, sample 77 is again identified as an outlier below the 3σ threshold. After removing samples 75 and 77, no other samples were identified as x-outliers. As a reminder, SRD is being used in a non-normal mode. The SRD normalized rankings in conjunction with Figures 1b and 1c would normally be used to indicate similar objects to the left of the random ranking mean value. y-Outlier Measures. Plotted in Figure S-6a is the SRD input matrix for predicting moisture using the two window based youtlier measures in the order of Table S-3. The SRD normalized ranking results shown in Figures S-6b and S-6c clearly reveal two samples as outliers with samples 71 and 1 ranking below the 3σ threshold. Removing sample 71 and running SRD again, it was found that sample 1 and the remaining samples are all now above 3σ and hence, sample 71 was probably

5

ACS Paragon Plus Environment

Analytical Chemistry

Samples 75 & 77

Figure 1. Corn data (Left) SRD input matrix with unit length normalized rows of all x-outlier measures, (center) SRD normalized ranks with the random ranking distribution and 3σ limit (red line), and (right) SRDCV box plots. Numbers in the center are respective samples out. swamping sample 1. Note that while samples 75 and 77 are xTable 2. Sample Sequentially Removed for Corn Predicoutliers, these samples are not y-outliers for moisture. tion Properties using x- and y-Outlier Measures Simultaneously The other prediction properties were y-outlier checked with Table 1 listing samples removed in sequential removal. SamIteration Moisture Oil Protein Starch ple 77 is found to be a y-outlier and x-outlier for oil and protein and sample 75 is a y-outlier and x-outlier for starch. 1 75 75 75 75 Table 1. Samples Sequentially Removed for Corn Prediction Properties using y-Outlier Measures Iteration

Moisture

1

71

Oil

Protein

Starch

77

77

75

2

27

2

3

7

11

If it is plausible to re-measure y-outlier samples for the reference value, then this can be performed and a decision can be made depending the re-evaluation. If it is not possible to remeasure, then the reference value may be suspect and the sample should be removed, assuming the reference values of the calibration set are well distributed. The sample may also not be matrix match relative to the matrix matching measure (eq. set S18 in Table S-3). x- and y-Outlier Measures Simultaneously. In addition to using the x- and y-outliers measures independently, the measures can be evaluated simultaneously with SRD. In this case, because of the larger number of x-outlier measures compared to the number of y-outlier measures, samples 75 and 77 are first identified as outliers regardless of the prediction property. Listed in Table 2 are the samples removed for each prediction property. The same y-outlier samples are identified for oil, protein, and starch. For moisture, after removal of samples 75, 77, and 71, two additional samples are now identified as outliers. Plotted in Figure 2 is the moisture RMSECV after removing the two x-outliers, one moisture y-outlier, two x- and one youtlier, and lastly, all five x- and y-outliers. From the plot, it appears that simultaneously using uisng x- and y-outlier measures reduces the RMSECV more than applying only the x-outlier measures. As expected, the largest RMSECV decreases result from removing samples deemed y-outliers. Thus, the suggested sequence

2

77

77

77

77

3

71

27

2

4

72

7

11

5

80

RMSECV

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 10

Figure 2. Corn moisture RMSECV plotted against the L2 regression vector norm for different number of samples. Full calibration set (blue solid), removal of x-outlier samples 75 and 77 (brown x), y-outlier sample 71 (gold square), x- and youtlier samples 75, 77, and 71 (purple circle), and x- and youtlier samples 75, 77, 71, 72, and 80 (green dash) identified with both sets of measures. is to first remove all samples identified as just y-outliers. Following removal of y-outliers, SRD is used to evaluate the xand y-outlier measures together. Removing y-outliers first avoids the potential dominance of the number of x-outlier measures. It should be noted that while the RMSECV values do reduce from removal of the samples, there is a sacrifice to representivity in the covariance structure of the calibration samples. This issue is further discussed in Number of Multiple Outlier section. Outlier Verification and Checking for Masking. The flexibility of SRD allows identification, verification, and checking for masking using the x- and y-outlier measures independently and or in combination. Presented here is only using the x-

6

ACS Paragon Plus Environment

Page 7 of 10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

outlier measures in Tables S-1 and S-2. Described first is an example of verification. For the corn data, samples 75 and 77 are identified as xoutliers. As described previously, verification proceeds by removing only sample 75 (lowest rank beyond 3σ) from the sample set. The usual SRD sample removal process is then used to compute outlier measures for respective samples out relative to the remaining samples. Outlier measures are also computed for sample 75 relative to the remaining samples. An SRD input matrix is assembled and the columns (samples) are ranked. Shown in Figure S-7a are the results of this analysis. Now sample 77 is the lowest ranked sample because when it is the sample removed, sample 75 is no longer in the remaining set to mask it. Because sample 75 is still beyond the 3σ threshold, outlier confirmation is obtained. After verifying sample 75 as an outlier, it can be checked for masking. Specifically, the SRD process is used to ascertain if any of the remaining samples are similar to an identified outlier. As described previously, this is accomplished by using the verification SRD input matrix. The only change is that instead of using an SRD maximum target, the outlier values for the identified outlier, sample 75 in this case, are used for the target. Shown in Figure S-7b are the results. As expected, sample 77 is clearly identified as best matching the outlier values for the sample 75. Note that performing a masking check for a sample is not restricted to samples identified as an outlier. A sample can be screened for masking at any time. Number of Multiple Outliers. Since masking is a large concern with multivariate calibration, especially in sizeable data sets, it is important to evaluate an outlier detection process for the number of multiple outliers of the same kind that can be identified. After a certain point, the number of outliers becomes too many and the sample space has been disrupted beyond the uniqueness of the normal samples. To examine multiple outliers, samples measured on instrument mp6 (minus samples 75 and 77) were used as the calibration set. Samples measured on instrument m5 acted as outliers and were randomly selected and then added to the mp6 set. After a set of m5 samples were added, the combined sample set went through the outlier cleaning process using only the xoutlier measures (y-outlier measures could also be used) until all samples below the 3σ threshold are removed. This process of adding samples was repeated 10 times, with a different random set of selected samples for each repetition. For each trial, the m5 set size increased sequentially, e.g., when 10 samples were added, the previous nine were the same. It was found that up to nine m5 samples could be added with successful identification of all nine outliers in each of the 10 attempts of adding m5 samples. When 10 samples were added, only for one of the 10 cases was SRD unable to identify all 10 m5 samples. In that one situation, SRD identified only three outliers (true positives) and the other seven m5 samples were false negatives as these samples were identified as normal (mp6). There were no false positives with no mp6 samples identified as outliers. This particular case exemplifies the difficulty with multiple outliers. Specifically, when nine m5 samples were added, all 9 samples are identified. Adding just one more sample to these same 9 samples caused the iden-

tification difficulty. Identifying only three of the 10 m5 samples was probably due to a unique masking situation. As noted previously, one could sequentially screen each sample for potential masking either before or after any SRD cleaning. When adding 15 samples and using a 3σ threshold, six of the 10 cases successfully identified all 15 m5 samples. In the other four situations, there were no false positives and as with the one case for 10 m5 samples added, only a few of the m5 added samples were correctly identified (true positives). The σ threshold was decreased to 2σ and applied to the same series of 10 random sets of m5 samples. As expected, due to the high density of mp6 samples, false positives appeared in all 10 cases of adding only one m5 sample. With the smaller 2σ threshold, up to 15 m5 samples could be added with all 15 samples now identified as outliers. The sacrifice is identifying a few mp6 samples as outliers (false positives). Thus, for the corn data, too large of a σ threshold can result in fewer multiple outliers being identified and with too small of a σ threshold, more multiple outliers can be identified at the risk of removing a few normal (inlier) samples. The purpose of this work is to describe a process to better determine when “unique” samples are present. These samples could be outliers or extreme.45 A recent paper discusses the dilemma of whether to remove identified outliers.46 A discussion of assessing outliers versus extreme samples including removal options is presented in the SI. Biscuit Data. The effect of the σ threshold is studied for the biscuit calibration set of 40 samples using the cleaning procedure recommended in the Corn section, i.e., first checking for only y-outliers followed by analysis using the x- and y-outlier measures simultaneously. This data set is smaller than the corn data and the score plot indicates a great deal of sparseness compared to the high density of the corn data. Thus, the σ threshold could have a greater effect on the number of outliers identified compared to the corn data. Results listed in Table 3 are from using σ thresholds of 3, 4, and 5. Due to the small number of samples and sparse score plot, it appears that a larger σ threshold is more applicable because too many nonoutlier samples may be removed with a smaller σ value. Similar to the corn data situation of adding samples from another instrument to the calibration set, too small of an σ threshold and too many samples are removed some of which are not outliers. Table 3. Number of Sample Sequentially Removed for Biscuit Prediction Properties Using y-Outlier Measures First Followed by Simultaneous x- and y-Outlier Measures at Different σ Thresholds σ Threshold

Fata

Surcosea

Dry Floura

Watera

3

0,6

2,4

2,3

3,2

4

0,3

2,4

2,3

1,1

5

0,0

2,0

0,1

1,0

a

First value is for only the y-outlier measures and the second value is for the x- and y-outlier measures.

7

ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 10

CONCLUSIONS

REFERENCES

Presented is a new SRD fusion based outlier detection method capable of performing outlier cleaning, verification, and checking for masking. Any number of outlier measures can be simultaneously evaluated for a consensus analysis. Concurrent assessment of multiple merits is vital since single merits can provide deceiving and conflicting results. Unlike other outlier measures, the method presented uses windows of tuning parameters thereby avoiding the necessity of selecting a tuning parameter for each particular outlier measure. The methods of PA and EISC were used in novel ways to form new outlier measures. The recommended SRD protocol is to first remove all samples identified as just y-outliers. After removal of these samples, SRD is used to evaluate the x- and y-outlier measures together. In addition to the outlier measures used in this study, others can be used to further enhance the SRD consensus. For example, PLS was used to form the y-outlier measures, but other regression methods could have also been used and augmented to the PLS results in the SRD evaluation. Another measure not used is the difference in the translation operation for both forms of PA. For the EISCD, the degree of difficulty was assessed bi-directionally and only one direction for PA. One could also expand PA to be bi-directional. Several ways to calculate the MD are also possible3,47 and could be added. An approach not used in this paper and should significantly enhance outlier detection is to augment the SRD input matrix of full spectral x- and y-outlier measures with x- and y-outlier measures based on wavelength windows. A combination of window sizes and variations in the amount windows slide across spectra can be used. Each combination would be additional rows in the SRD input matrix. Also not used in this study but can be added is to use a CV process on the sample set.10-12,14,27 Again, each data split would create another block of rows for the SRD input matrix. Such a process would further guard against masking and swamping. The SRD fusion process can be adapted for quality control. In quality control, new samples are assessed for predictability by a calibration model, i.e., are new samples outliers to the calibration samples? The three SRD operational modes (identification, verification, and masking check) can be used for quality control.

(1) Næs, T.; Isaksson, T.; Fern, T.; Davies, T. A User Friendly Guide to Multivariate Calibration and Classification; Chichester: UK, 2002. (3) Penny, K. I.; Jolliffe, I. T. J. R. Stat. Soc. Ser. D (Stat.) 2001, 50, 295-307 (4) Hawkins, D. M. Identification of Outliers; Chapman and Hall: London, 1980. (5) Walczak, B.; Massart, D. L. Chemom. Intell. Lab. Syst. 1998, 41, 1-. (6) Rousseeuw, P. J.; Leory, A. M. Robust Regression and Outlier Detection; Hoboken: NJ, 1987. (7) Hubert, M. In Comprehensive Chemometrics: Chemical and Biochemical Data Analysis Vol. 3; Brown, S. D., Tauler, R., Walczak, B., Eds.; Elsevier: Amsterdam, 2009; pp 315-343. (8) Rousseeuw, P. J. J. Am. Stat. Assoc. 1984, 79, 871-880. (9) Pell, R. Chemom. Intell. Lab. Syst. 2000, 52, 87-104. (10) Cao, D. S.; Liang, Y. Z.; Xu, Q. S.; Li, H. D.; Chen, X. J. Comput. Chem. 2009, 31, 592-602. (11) Zhang, L.; Wang, D.; Gao, R.; Li, P.; Zhang, W.; Mao, J.; Yu, L.; Ding, X.; Zhang, Q. Chemom. Intell. Lab. Syst. 2016, 151, 89-94. (12) Roberts, S.; Martin, M.; Zheng, L. Technometrics 2015, 57, 408-417. (13) Tóth, Z.; Bodai, Z.; Hébeger, K. J. Computer-Aid. Mol. Design 2013, 27, 837-844. (14) Li, H. D.; Liang, Y. Z.; Xu, Q. S.; Cao, D. S. Trends Anal. Chem. 2012, 38, 154-162. (15) Becker, C.; Gather, U. J. Am. Stat. Assoc. 1999, 94, 947-955. (16) Rousseeuw, P. J.; van Zomeren, B. C. J. Am. Stat. Assoc. 1990, 85, 633-651. (17) Egan, W. J.; Morgan, S. L. Anal. Chem. 1998, 70, 2032-2379. (18) Héberger, K. TrAC, Trends Anal. Chem. 2010, 29, 1011-109. (19) Héberger, K.; Kollár-Hunek, K. J. Chemom. 2011, 25, 151158. (20) Kollár-Hunek, K.; Héberger, K. Chemom. Intell. Lab. Syst. 2013, 127, 139-146. (21) Kalivas, J. H., Héberger, K.; Andries, E. Anal. Chim. Acta. 2015, 869, 21-33. (22) Andrade, J. M.; Gómez-Carracedo, M. P.; Krzanowski, W.; Kubista M. Chemom. Intell. Lab. Syst. 2004, 72, 123–132. (23) Kalivas, J. H. J. Chemom. 2008, 22, 227-234. (24) Hellend, I. S; Næs, T.; Isaksson, T. Chemom. Ntell. Lab. Syst. 1995, 29, 233-241. (25) Pedersen, D. K.; Martens, H.; Nielsen, J. P.; Engelsen, S. B. Appl. Spectrosc. 2002, 56, 1206-1214. (26) Geladi, P.; MacDougall, D.; Martens, H. Appl. Spectrosc. 1985, 39, 491-500. (27) Martens, H.; Stark, E. J. Pharm. Biomed. Anal. 1991, 9, 625635. (28) Wise B. M.; Gallagher N. B. Eigenvector Research, Manson, Washington. (29) Osborne, B. G.; Fearn, T.; Miller, A. R.; Douglas, S. J. Sci. Food Agric. 1984, 35, 99-105. (30) Sánchez, F. C.; Toft, J.; van den Bogaert, B.; Massart, D. L. Anal. Chem. 1996, 68, 79-85. (31) Ramsay, J. O.; Berge, J. T.; Styan, G.P.H. Psychometrika 1984, 49, 403-423. (32) Barros, A. S.; Safar, M.; Devaux, M. F.; Robert, P.; Bertrand, D.; Rutledge, D. N. Appl. Spectrosc. 1997, 51, 1384-1393. (33) Sigman, M. E.; Williams, M. R.; Ivy, R.G. Anal. Chem. 2007, 79, 3462-3468. (34) Anderson, C. E.; Kalivas, J. H. Appl. Spectrosc. 1999, 53, 1268-1276. (35) Berns, R. S.; Petersen, K. H. Color Res. Appl. 1988, 13, 243256. (36) Ottaway, J.; Kalivas, J. H. Appl. Spectrosc. 2015, 69, 407-416.

ASSOCIATED CONTENT Supporting Information Tables S-1 through S-3 and Figures S-1 through S-7 as referenced in the text.

AUTHOR INFORMATION Corresponding Author *Email: [email protected]

ACKNOWLEDGMENTS This material is based upon work supported by the National Science Foundation under Grant No. CHE-1111053 (co-funded by MPS Chemistry and the OCI Venture Fund) and is gratefully acknowledged by the authors.

8

ACS Paragon Plus Environment

Page 9 of 10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

(37) Ferré, J. In Comprehensive Chemometrics: Chemical and Biochemical Data Analysis Vol. 3; Brown, S. D., Tauler, R., Walczak, B., Eds.; Elsevier: Amsterdam, 2009; pp 33- 89. (38) Higgins, K.; Kalivas, J. H.; Andries, E. J. Chemom. 2012, 26, 66-75. (39) Goodenough, D. G.; Narendra, P. M.; and O’Neill, K. Can. J. Remote Sens. 1978, 4, 143-148. (40) Cook, R. D.; Weisberg, S. Residuals and Influence in Regression; Chapman and Hall: London, 1995. (41) Emerson, R.; Kalivas. J. H. In preparation for submission 2017. (42) Tencate, A. J.; Kalivas, J. H.; White, A. J. Anal. Chim. Acta. 2016, 921, 28-37. (43) http://www.isu.edu/chem/people/faculty/kalijohn/, assessed January 2017. (44) http://www.models.life.ku.dk/datasets, accessed January, 2017. (45) Pomerantsev, A.L.; Rodionova, O.Y. J. Chemom. 2014, 28, 429-438. (46) Fearn T. NIR News 2016, 27, 25. (47) Todeschini, R.; Ballabio, D.; Consonni, V.; Sahigara, F.; Filzmoser, P. Anal. Chim. Acta. 2013, 787, 1-9.

9

ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 10

FOR TOC ONLY

8 6

SRD consensus

10-3

3σ 75

77

71

72 79 68 27 46 34 8047 32 11 15 125457 16 43 66 4041 56 67 2519 30 13 29 55 7414 73 17 78 26 60 23 28 4265 69 59 64 76 51 35 36 2518 33 763 39 45 458 38 253 61 62 4849 21 70 24 37 22 31 20 110 6944 58350

4 Outliers 2 0

40

60

80

100

Normalized SRD

10

ACS Paragon Plus Environment