Outlier Detection in Multivariate Analytical ... - ACS Publications

For principal component regression (PCR), Naes25 partitioned the hat matrix ...... Big data driven outlier detection for soybean straw near infrared s...
0 downloads 0 Views 177KB Size
Anal. Chem. 1998, 70, 2372-2379

Outlier Detection in Multivariate Analytical Chemical Data William J. Egan and Stephen L. Morgan*

Department of Chemistry & Biochemistry, The University of South Carolina, Columbia, South Carolina 29208

The unreliability of multivariate outlier detection techniques such as Mahalanobis distance and hat matrix leverage has been known in the statistical community for well over a decade. However, only within the past few years has a serious effort been made to introduce robust methods for the detection of multivariate outliers into the chemical literature. Techniques such as the minimum volume ellipsoid (MVE), multivariate trimming (MVT), and M-estimators (e.g., PROP), and others similar to them, such as the minimum covariance determinant (MCD), rely upon algorithms that are difficult to program and may require significant processing times. While MCD and MVE have been shown to be statistically sound, we found MVT unreliable due to the method’s use of the Mahalanobis distance measure in its initial step. We examined the performance of MCD and MVT on selected data sets and in simulations and compared the results with two methods of our own devising. Both the proposed resampling by the half-means method and the smallest half-volume method are simple to use, are conceptually clear, and provide results superior to MVT and the current best-performing technique, MCD. Either proposed method is recommended for the detection of multiple outliers in multivariate data. Analytical chemists must be concerned with the reliability of their data. Outliers do not make for reliable data. The recognition of unrepresentative measurements and the statistically based justification for giving those observations less weight in decision making are critical for the validation of laboratory data. When analyzing univariate data, analytical chemists most often apply Dixon’s Q-test.1-3 Unfortunately, improvements in the capabilities of chemical instrumentation and computers in past decade have dramatically increased the dimensionality of experimental data and univariate techniques are not appropriate for multivariate data. For example, charge-coupled arrays may acquire spectra with intensities at over 1000 wavelengths and gas chromatographic/ mass spectrometric instruments produce megabyte-sized retention time/mass-to-charge ratio matrices. This dimensionality increase has made finding outliers in such complex data considerably more difficult, as there is no hope of efficiently examining such enormous multivariate data sets in their original forms.

In their excellent, encyclopedic reference on outliers, Barnett and Lewis4 defined an outlier as “an observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data”. The apparent inconsistency can be caused by contamination from another distribution, i.e., another type of observation, or extreme observations truly belonging to the expected group(s) in the data set. Either way, these observations must be first identified and examined before the decision to use, modify, or discard them can be made. Given this rather broad definition of an outlier, what measures best detect outliers and when should these measures be employed? The more commonly used measures, such as the mean, standard deviation, Mahalanobis distance,5-7 and the hat matrix leverage,8 rely upon testing whether a statistic exceeds some critical value derived from a specified distribution, such as the normal or χ2 distributions. These methods give inaccurate results when there are multiple outliers in the data.9,10 Multiple outliers will distort measures of central location (the mean) and dispersion (the covariance matrix) to such an extent that those observations may not be recognized when analyzed by Mahalanobis distance or hat matrix leverage. This phenomenon is termed the masking effect. The distortion may also cause “good” observations to appear to be outliers; this effect is called the swamping effect. Thus, parametric methods will fail with multiple outliers because they inherently cannot take into account the masking and swamping effects. The important point is that traditional methods of outlier detection can fail without warning and need to be replaced by more robust methods. Robust analogues to the mean and standard deviation, such as the median and median absolute deviation (MAD), exist.11,12 However, these methods may perform poorly when the number of observations is small. This weakness is inherent to the method of calculation, because the median is defined as the observation in the middle of the range (or the average of the two observations closest to the middle for an even number of observations). For a small number of observations, the median may not truly

* To whom correspondence should be addressed: (phone) (803)-777-2461; (fax) (803)-777-9521; (e-mail) [email protected]. (1) Miller, J. N. Analyst 1993, 118, 455-461. (2) Dixon, W. J. Ann. Math. Stat. 1950, 21, 488-506. (3) Rorabacher, D. B. Anal. Chem. 1991, 63, 139-146.

(4) Barnett, V.; Lewis, T. Outliers in Statistical Data, 3rd ed.; John Wiley & Sons: New York, 1994; p 7. (5) Mark, H.; Tunnell, D. Anal. Chem. 1985, 57, 1449-1456. (6) Shak, N. K.; Gemperline, P. J. Anal. Chem. 1990, 62, 465-470. (7) Gemperline, P. J.; Boyer, N. R. Anal. Chem. 1995, 67, 160-166. (8) Hoaglin, D. C.; Welsch, R. E. Am. Stat. 1978, 32, 17-22. (9) Rousseeuw, P. J.; Leroy, A. M. Robust Regression and Outlier Detection; John Wiley & Sons: New York, 1987. (10) Rousseeuw, P. J.; van Zomeran, B. C. J. Am. Stat. Assoc. 1990, 85, 633639. (11) Rousseeuw, P. J. J. Chemom. 1991, 5, 1-20. (12) Iglewicz, B.; Hoaglin, D. C. How to Detect and Handle Outliers; Quality Press: Milwaukee, WI, 1993.

2372 Analytical Chemistry, Vol. 70, No. 11, June 1, 1998

S0003-2700(97)00763-4 CCC: $15.00

© 1998 American Chemical Society Published on Web 05/01/1998

represent the center of the distribution of the data; for a large number of observations, the median will be unaffected by outliers. Also, the median and MAD are univariate measures only. A common use for the median and MAD is to replace the mean and standard deviation when data are mean centered or autoscaled. Rousseeuw13 extended the median methodology to a robust regression technique, least median of squares (LMS). Shortly after its development, LMS was introduced in the chemical literature.14 LMS can find all unusual observations, but may erroneously identify normal observations as outliers.15 LMS also requires a lengthy optimization and is unstable in certain situations.16,17 Atkinson18 recommended using LMS with subsets of the original data to locate possible outliers and then using standard diagnostic techniques (e.g., standardized residuals and a modified Cook’s statistic) on the trimmed data set. Numerous methods for outlier detection have been based on the eigendecomposition, or principal component analysis (PCA), of the covariance matrix.19 Joliffe20 discussed the use of PCA for outlier detection. Singular value decomposition (SVD) is the most numerically stable algorithm for PCA.21 Soft independent modeling by class analogy (SIMCA) was applied to outlier detection by Mertens et al.22 Gnanadesikan and Kettenring23 used the first and last few PCs to detect outliers. The first few PCs are distorted by observations that inflate variances, and the last few PCs are affected by observations that hinder recognition of an illconditioned data matrix. Hocking24 has suggested that outliers can be identified by performing PCA on a combined matrix containing both the dependent variable (y) and the independent variables (x’s). For principal component regression (PCR), Naes25 partitioned the hat matrix into two parts to aid in detecting outliers: one calculated from PCs used in the regression and the other calculated from the remaining PCs. However, no standard PCA method, regardless of which individual PCs are used or how they are examined, can fully correct for the effects of outliers, because the PCs themselves are distorted by the biased estimate of the covariance matrix. Robust multivariate methods for dealing with the problems caused by multiple outliers have been investigated. Often, the criterion for determining the effectiveness of a method is the breakdown point,11,26 i.e., the fraction of outliers required to make the statistic of interest unreliable. For example, the breakdown point of the mean is 1/n for n samples, which tends to zero as n increases. The mean, by this measure, is an ineffective statistic (13) Rousseeuw, P. J. J. Am. Stat. Assoc. 1984, 79, 871-880. (14) Massart, D. L.; Kaufman, L.; Rousseeuw, P. J.; Leroy, A. Anal. Chim. Acta 1986, 187, 171-179. (15) Fung, W. J. Am. Stat. Assoc. 1993, 88, 515-519. (16) Yuzhu, H.; Smeyers-Berbeke, J.; Massart, D. L. Chemom. Intell. Lab. Syst. 1990, 9, 31-44. (17) Hettmansperger, T. P.; Sheather, S. J. Am. Stat. 1992, 46, 79-83. (18) Atkinson, A. C. Biometrika 1986, 73, 533-541. (19) Mardia, K. V.; Kent, J. T.; Bibby, J. M. Multivariate Analysis; Academic Press: Ltd.: London, 1979. (20) Jolliffe, I. T. Principal Component Analysis; Springer-Verlag: New York, 1986. (21) Press: W. H.; Flannery, B. P.; Teukolsky, S. A.; Vetterling, W. T. Numerical Recipes: The Art of Scientific Computing; Cambridge University Press: Cambridge, U.K., 1986; pp 52-63. (22) Mertens, B.; Thompson, M.; Fearn, T. Analyst 1994, 119, 2777-2784. (23) Gnanadesikan, R.; Kettenring, J. R. Biometrics 1972, 28, 81-124. (24) Hocking, R. R. Technometrics 1984, 26, 321-323. (25) Naes, T. Chemom. Intell. Lab. Syst. 1989, 5, 155-168. (26) Hampel, F. R. Ann. Math. Stat. 1971, 42, 1887-1896.

for detecting outliers. Campbell27 used M-estimators,28,29 which minimize influence functions other than the standard sum of squares of residuals and iteratively reweight the samples, calculating a robust estimate of the mean and covariance matrix. Singh30-32 used an M-estimator with a modified influence function (called PROP) for robust PCA in a series of chemometrics articles. Devlin et al.33 did not favor M-estimators, which have a breakdown point less than 1/(p + 1) for p dimensional data, and recommended ellipsoidal multivariate trimming (MVT),23 which they claimed has a breakdown point equal to its trimming percentage. MVT has also been combined with principal component regression.34 However, Donoho and Gasko35 argued that MVT, along with convex hull peeling, ellipsoidal peeling, and sequential deletion, have breakdown points no greater than 1/(p + 1). Such low breakdown points would seriously compromise these methods. Newer methods, utilizing genetic36 or evolutionary37,38 algorithms for outlier detection, and projection pursuit for robust PCA/ SVD,39,40 typically involve lengthy and complicated computations. Hove et al.41 introduced a somewhat less complex technique termed trimmed object projections to perform robust PCA. Excellent results are obtained when the minimum volume ellipsoid (MVE) or minimum covariance determinant (MCD) is used.10, 42 However, as Woodruff and Rocke43 pointed out, “...current algorithms require unreasonable computational effort in high dimension...,” especially the MVE.44 These methods work best when used solely for the identification of possible outliers.15,45 Cho and Gemperline46 recently demonstrated the superior performance of MVE over Mahalanobis distance for detecting outliers in nearIR data. A tutorial review covering robust multivariate techniques used in chemical applications has also appeared.47 While theoretically sound, the complexity of many proposed methods, in terms of both comprehension and implementation, hampers their spread and use. We offer a different view of the multivariate outlier detection problem which leads to two methods (27) Campbell, N. A. Appl. Stat. 1980, 29, 231-237. (28) Maronna, R. A. Ann. Stat. 1976, 4, 51-67. (29) Hoaglin, D. C., Mosteller, F., Tukey, J. W., Eds. Understanding Robust and Exploratory Data Analysis; John Wiley & Sons: New York, 1983. (30) Singh, A. In Multivariate Environmental Statistics; Patil, G. P., Rao, C. R., Eds.; North-Holland: Amsterdam, 1993; Chapter 22. (31) Singh, A.; Nocerino, J. M. In The Handbook of Environmental Chemistry; Einax, J., Ed.; Springer: New York, 1995; Vol. 2, part G, pp 229-277. (32) Singh, A. Chemom. Intell. Lab. Syst. 1996, 33, 75-100. (33) Devlin, S. J.; Gnanadesikan, R.; Kettenring, J. R. J. Am. Stat. Assoc. 1981, 76, 354-362. (34) Walczak, B.; Massart, D. L. Chemom. Intell. Lab. Syst. 1995, 27, 41-54. (35) Donoho, D. L.; Gasko, M. Ann. Stat. 1992, 20, 1803-1827. (36) Vankeerberghen, P.; Smeyers-Verbeke, J.; Leardi, R.; Karr, C. L.; Massart, D. L. Chemom. Intell. Lab. Syst. 1995, 28, 73-87. (37) Walczak, B. Chemom. Intell. Lab. Syst. 1995, 28, 259-272. (38) Walczak, B. Chemom. Intell. Lab. Syst. 1995, 29, 63-73. (39) Ammon, L. P. J. Am. Stat. Assoc. 1993, 88, 505-514. (40) Xie, Y.; Wang, J.; Liang, Y.; Sun, L.; Song, X.; Yu, R. J. Chemom. 1993, 7, 527-541. (41) Hove, H.; Liang, Y.; Kvalheim, O. M. Chemom. Intell. Lab. Syst. 1995, 27, 33-40. (42) Rousseeuw, P. J. In Proceedings of the 4th Pannonian Symposium on Mathematics and Statistics; Grossman, W., Pflug, G., Vincze, I., Wertz, W., Eds.; D. Reidel Publishing Co.: Dordrecht, The Netherlands, 1983; pp 283297. (43) Woodruff, D. L.; Rocke, D. M. J. Am. Stat. Assoc. 1994, 89, 888-896. (44) Rocke, D. M.; Woodruff, D. L. J. Am. Stat. Assoc. 1996, 91, 1047-1061. (45) Atkinson, A. C. J. Am. Stat. Assoc. 1994, 89, 1329-1339. (46) Cho, J.; Gemperline, P. J. J. Chemom. 1995, 9, 169-178. (47) Liang, L.; Kvalheim, O. M. Chemom. Intell. Lab. Syst. 1996, 32, 1-10.

Analytical Chemistry, Vol. 70, No. 11, June 1, 1998

2373

that are easy to understand, simple to implement, and still handle the difficult problems that cause traditional methods to fail. THEORY Notation. The data matrix X has n observations in the rows and p variables in the columns. X is column mean centered, unless stated otherwise. Vectors are shown as bold lowercase, while scalar variables are shown in lowercase. Classical Methods. Most methods for outlier detection use some form of the covariance matrix C. The covariance matrix is the sums of squares and products matrix, corrected for the number of degrees of freedom (n - 1).

C ) X′X/(n - 1)

(1)

Mahalanobis distances for each observation from the centroid of the data matrix are directly derived from C. A column vector containing the square of the Mahalanobis distances, d2, is extracted from the diagonal elements of the matrix obtained by premultiplying the inverse of C by the mean-centered data matrix X and postmultiplying by the transpose of X:

d2 ) diag(XC-1X′)

(2)

The hat matrix is obtained by premultiplying the inverse of X′X by the mean-centered data matrix X and postmultiplying by the transpose of X, designated X′:

H ) X(X′X)-1X′

(3)

The diagonals of the hat matrix (hii) are termed the leverage values. Observations having hii greater than 2p/n are commonly considered outliers. Singular Value Decomposition. SVD is well-known to be the most numerically stable method for calculating the eigendecomposition of a matrix. SVD decomposes the data matrix into three matrices:

Xnxp ) Un×kSk×kV′kxp

(4)

These matrices have the following properties: because U and V are column orthonormal, U′U ) V′V ) I, and S is a diagonal matrix containing the singular values, which are the square roots of the eigenvalues of XX′ and X′X. For data arranged in our notation, the normalized principal component scores are the columns of U, while the loadings are the columns of V. The classical statistics shown, written in terms of the SVD, are

C)

VS2V′ n-1

(5)

d2 ) diag((n - 1)UU′)

(6)

hii ) diag(H) ) diag(UU′)

(7)

2374 Analytical Chemistry, Vol. 70, No. 11, June 1, 1998

Therefore, the squared Mahalanobis distances differ from the leverage values by only the degrees of freedom correction. Rousseeuw9 found the same result with a different derivation. MVT Method. For multivariate trimming,23 squared Mahalanobis distance values for each observation in the entire data set are calculated. A fixed percentage of the observations with the highest d2, chosen beforehand by the analyst, are removed and the process is repeated until either successive estimates of C stabilize or a set total percentage of the observations is trimmed. The remaining observations are used to calculate a mean and covariance matrix. MVT implicitly assumes the Mahalanobis distances are correct in some general fashion, even if they do not indicate that an observation is an outlier when compared, for instance, to the χ2 distribution. As the literature10 shows, the Mahalanobis distance will fail when used on data containing multiple outliers. MCD Method. The MCD method42 randomly selects subsets of the data (of size n/2) and computes the determinant of each matrix of subset data. The determinant of a matrix is the product of the eigenvalues of that matrix and measures the p-dimensional volume of the data (or, in this case, the subset of data selected). This follows from the fact that the eigenvalues represent the variances, or scales, of each eigenvector; the eigenvalues are the lengths of each eigenvector. The subset with the smallest nonzero determinant is then used to generate a covariance matrix from which multivariate distances (∼d2) are calculated to detect outliers. Proposed Methods. The covariance matrix is the basis of most statistics: Mahalanobis distances, hat matrix leverage values, principal component scores, robust PCA, MVE, MCD, etc. The robust methods generate robust analogues to C, while the nonrobust methods use the original covariance matrix. All these methods require that a measure of central location be known or estimated, since they all subtract the mean (or centroid) from the data. Outlying observations, by definition, distort this measure of location. Given p dimensions, outliers may have extreme values on some, all, or even none of those dimensions. The covariance matrix is commonly used for outlier detection, since it provides a measure (sometimes unreliable) of how the observations are oriented in those p dimensions. Our approaches work with the concept of a center (centroid) for the data, rather than using the covariance matrix. The orientation of each observation in p dimensions may be thought of as a vector projecting out from the centroid in a particular direction. The length and orientation of each vector are determined by the position of the centroid of the data, and these two characteristics are greatly affected by outliers. Figure 1 demonstrates this effect with two plots, each one using the same two features for the x and y axes, based on the wood data, described below. Figure 1A shows the data vectors without outliers; Figure 1B shows the effect of outliers on the vectors. The data used to generate each subplot was autoscaled; i.e., each column in the data matrix was mean centered and divided by the standard deviation of values in that column. Because the goal is to create a test statistic that is resilient to the effects of as many outliers as possible, use of the mean calculated from the entire data set for this purpose would be

Figure 1. Example of the distortion of observation vector lengths and orientations caused by the presence of outliers: (A) data set 1 without outliers; (B) data set 1 with four outliers added. Many of the vector lengths and orientations change noticeably. Both sets of data were autoscaled.

inappropriate. While a number of authors42,48 have advocated subset selection approaches, such methods have generally been restricted to subsets of size p or p + 1 (used in the regression case) or 1/2(n + p + 1) (for MVE). Woodruff and Rocke43 found that larger sample sizes improve results in their study of MCD and MVE. In the proposed outlier detection methods, we decided that a reasonable number of observations on which to base statistics is 50% of the full data set. Half the data must be “good”, otherwise, we are not searching for outliers. If half the observations are to be evaluated, the next question to be answered is, which half? We propose two methods that provide different ways to answer this question. Method 1. In general, resampling methods are used to generate a distribution of some statistic of interest by repeatedly calculating that statistic from randomly selected portions of the data. Each of these randomly selected portions of the data is termed a sample. There are two common sampling procedures. Sampling without replacement means that once an observation is selected at random, it may not be selected again for that sample. Sampling with replacement, on the other hand, means that after an observation is selected, it may be selected again for inclusion in that particular sample. The well-known technique bootstrapping uses sampling with replacement.49 We detect outliers by studying the distribution of observation vector lengths obtained by sampling without replacement from the original data set. We refer to this method as resampling by half-means (RHM). Figure 2 presents a flow chart for the complete algorithm. Observations are selected (without replacement) from the original data matrix X until a sample size of 50% (i.e., n/2) of the original observations has been attained. These sampled observations are placed in a new matrix, X(i), designated the ith sample matrix. The mean m(i) and standard deviation s(i) of X(i) are calculated. The original data matrix X is autoscaled using the sample mean and standard deviation. The autoscaled X matrix is then employed to calculate a column vector l(i) (48) Hawkins, D. M.; Bradu, D.; Kass, G. V. Technometrics 1984, 26, 197-208. (49) Efron, B.; Tibshirani, R. J. An Introduction to the Bootstrap; Chapman & Hall: New York, 1993.

Figure 2. Flow chart showing the steps used in the RHM method of outlier detection.

containing the vector lengths for all observations in the ith sample matrix, over all p variables:

l(i) )

x∑(

)

p

Xk - mk(i)

k)1

sk(i)

2

(8)

The autoscaling performed during RHM serves the purpose of altering the spread of data on each variable to place them in a common frame of reference. Just as the covariance-based methods create a decision boundary using a hyperellipse, autoscaling permits RHM to create a decision boundary using a hypersphere. (Note: autoscaling is most appropriate when the variables describing the observations use different scales; spectral variables (wavelengths) are on the same scale and should not be divided by their standard deviation unless there is a definite reason to do so.) After i samples have been generated, an n by i matrix of vector lengths L is produced. Outlying observations can be detected by examining the empirical distribution of the vector lengths. For a sample that does not contain any outliers, the vector lengths of the outliers (with respect to the original data matrix) will be large and appear in the highest region of the distribution. Remember that X in eq 8 is the data matrix for the entire data set, and vector lengths are calculated for all observations. The squared vector lengths are sums of squares and should follow a χ2 distribution. However, in our experience, the squared Analytical Chemistry, Vol. 70, No. 11, June 1, 1998

2375

Figure 3. (A) Histogram of vector lengths calculated by the RHM method for a data set randomly generated using a multivariate normal distribution. (B) Histogram of vector lengths calculated by the RHM method for a data set generated in similar fashion, but with 10% outliers.

vector lengths do not follow that distribution, most likely due to the sampling scheme employed. Thus, we have used the empirically derived distribution, as is commonly done in bootstrapping. The deviation of the RHM vector length distribution from the χ2 distribution is quite apparent. Figure 3A shows a histogram of the vector lengths from a 500-observation data set randomly generated using the multivariate normal distribution. Figure 3B shows the histogram for a similar data set contaminated 10% by observations from a multivariate normal distribution with a different mean and covariance matrix. Note the appearance of a second, smaller region, which corresponds to extreme vector lengths caused by the outlying observations. This provides an excellent visual diagnostic for determining whether observations selected as outliers really belong to the data set or belong to another distribution entirely. A visual diagnostic is insufficient. The RHM method inherently provides a simple way to identify outlying observations. When the vector lengths for all the samples stored in L are sorted, the upper 5% are checked to see which observation corresponds to which extreme vector length. The number of times each observation appears in the upper 5% of the distribution is tabulated, revealing the outliers. Method 2. The second proposed method makes use of the distances between each observation in the multivariate space, expressed as vector lengths. The vector length between the two observations i and j is

∑(x - x )

lij ) x

2

i

j

(9)

where the summation is over all variables. Note that the data may, but need not, be scaled before calculating the distances between observations. In cluster analysis, a matrix called the distance matrix is often calculated. The distance matrix is simply a n × n matrix with each entry containing the distances between any two observations. A number of distance measures exist; we have used the vector 2376 Analytical Chemistry, Vol. 70, No. 11, June 1, 1998

length. Obviously, the diagonals of the matrix are equal to zero, as each observation has no distance from itself. Once the vector lengths between observations are calculated and stored in the distance matrix, each column is sorted from lowest to highest distance. For each column, the first n/2 smallest distances are summed. The column with the smallest sum has an interesting propertyssince each column represents one of the observations, this particular column represents the observation with the smallest cloud of n/(2 - 1) observations around it. Since determining which group of n/2 observations is most internally similar is the goal of MCD, MVE, and RHM, the use of the distance matrix provides an interesting way to quickly determine this. Once the n/2 most similar observations have been selected, their mean vector and covariance matrix are easily calculated. (If any scaling has been performed, it should be derived from the n/2 most similar observations and then applied to the entire data set.) Then, the Mahalanobis distances for all n observations are calculated using those measures of centrality and dispersion, which may then be used to identify outliers. This is essentially what MCD does, but the need for many eigendecompositions to calculate the determinants of numerous samples of size n/2 is avoided. We refer to this method as the smallest half-volume (SHV) method. TEST DATA To test algorithms for detecting multivariate outliers, we have chosen two data sets often used for comparison in the statistical9,10,15,45 and chemical literature,37 plus an additional data set from one of our research projects. All data sets exhibit extreme masking effects caused by multiple outliers. Wood Data Set. Rousseeuw9 modified the well-known wood data used in Draper and Smith’s classic text.50 The modified wood data includes 20 observations, 5 independent variables measuring fiber percentages and specific types of wood content, and 1 dependent variable, the specific gravity of each piece of wood. Rousseeuw altered the data to make four of the observations (4, 6, 8, and 19) outliers. The data set (independent variables only) was autoscaled and then analyzed using hat matrix leverage values, the first PC, MVT, MCD, and proposed RHM and SHV methods. MVT used 25% trimming with iterations terminated after the covariance matrix stabilized. MCD42 sampled the data 5000 times and determined approximate squared Mahalanobis distances for all observations based on the subset of data having the smallest determinant. RHM used 500 samples and determined relative frequencies of those observations having vector lengths in the upper 5% of vector lengths. All algorithms for were written in MATLAB for Windows, v4.2c.1 (The MathWorks, Inc., Natick, MA). HBK Data Set. Described and analyzed fully in the Supporting Information. Reflectance Absorbance Infrared (RA-IR) Data Set. Described and analyzed fully in the Supporting Information. Simulation Study. The performances of the RHM and SHV methods were compared to that of the MCD and MVT methods on data sets randomly generated from a multivariate normal (50) Draper, N.; Smith, H. Applied Regression Analysis; John Wiley & Sons: New York, 1966; p. 227.

Table 1. Levels of Six Factors Used in the Mixture Design for the Simulation Test level

no. of observns, N

no. of variables, P

fraction of outliers, F

distance of outliers from edge, D

spherical spread of outliers, S

no. of samples, NSa ) number below × N

1 2 3 4 5

20 65 110 155 200

5 16 28 39 50

0.1 0.2 0.3 0.4 0.5

1.1 1.575 2.05 2.525 3

0.1 0.2 0.3 0.4 0.5

1 2.25 3.5 4.75 6

a

NS ) (number in column) × N.

distribution contaminated with outliers. Characteristics of the random data were varied according to a mixture experimental design with five levels for each of six factors: number of observations (N), number of variables (P), fraction of outliers (F), distance of the outlier centroid from the farthest edge of the good data (D), spherical spread of the outliers (S), and number of samples used for RHM (NS).51 Outliers were generated in accordance with the “shift” model, where the outliers have a covariance matrix similar to the rest of the data but have a shifted mean.44 The percentages of outliers varied from 10 to 50%, in 10% increments. The outlier percentage was allowed to go as high as 50% to be consistent with the ideal breakdown point concept from the statistical literature. The number of variables was allowed to exceed the number of observations to reflect real spectral data. Each simulation includes 500 runs; the design points for each run were selected randomly with equal probability from the factor levels shown in Table 1. The RHM test for the distribution of vector lengths was kept at 95% for all simulations. The percentage of actual outliers identified was used as the fitness criterion.

Table 2. Analysis Results for the Wood Dataa hat matrix leverage, observn hii 1 2 3 4b 5 6b 7 8b 9 10 11 12 13 14 15 16 17 18 19b 20 a

RESULTS AND DISCUSSION Wood Data Set. Observations 4, 6, 8, and 19 are the outliers. Results for each of the multivariate outlier detection techniques applied to data set 1 are shown in Table 2. Not one hat matrix leverage value (hii) exceeded the recommended cutoff for outlier detection of 2p/n (0.50 in this case). Thus, no outliers were found with this method. This result is consistent with that of Rousseeuw,9 who did not autoscale the data. (The change in preprocessing might have altered the results; it did not.) Observations 7 and 16, which are not outlying, came very close to the cutoff. The failure of the leverage values was expected, because of the distortion of the covariance matrix by the multiple outliers. MVT identified no true outliers, instead selecting six of the valid observations as outliers because they exceeded the critical value of 11.07 for the χ2 distribution (5 df). MVT, which trims observations with high d2 values, was also expected to fail. Because the Mahalanobis distances based on the mean and covariance matrix will be incorrect in the presence of multiple outliers, “good” observations and not outliers are removed. The first normalized PC score (U1), MCD, RHM, and SHV methods all correctly indicated observations 4, 6, 8, and 19 as outlying. We expected extreme outliers to be identified by their PC scores, since outliers greatly distort the covariance matrix. Observation 10, part of the unaltered data, has a nonnegligible (51) Cornell, J. A. Experiments with Mixtures: Designs, Models, and the Analysis of Mixture Data; John Wiley & Sons: New York, 1990.

0.2277 0.0817 0.1697 0.2084 0.1725 0.2091 0.4802 0.2387 0.2981 0.3993 0.2671 0.3596 0.2372 0.0789 0.1024 0.4762 0.2394 0.2440 0.2421 0.2676

MVT d2

1st PC score, U1

MCD ∼d2

RHM counts

SHV ∼d2

7.7946 2.9767 6.0608 4.183 5.1642 4.2102 44.8921 5.8316 27.3597 18.5022 8.7995 16.7254 9.6303 1.4081 5.793 30.1707 12.4253 8.7188 4.6587 7.3453

0.1491 0.1408 0.0328 -0.3985 0.0651 -0.4207 0.0822 -0.4393 0.1286 0.1992 0.0358 0.2112 0.1969 0.1495 0.1252 0.0602 0.1113 0.0222 -0.4706 0.019

10.5872 2.6008 8.2235 155.5182 7.1119 171.9537 11.1910 176.0948 6.0226 24.7803 13.0462 9.6668 12.0180 6.5459 5.8131 14.5320 13.1902 12.9793 200.1913 16.5148

0 0 0 71 0 104 3 121 0 40 0 12 9 0 0 0 0 0 140 0

4.6750 4.2346 4.6055 147.1103 2.6068 169.3348 45.3022 192.2662 24.4689 19.1563 38.8459 19.6781 17.1921 2.4662 1.7041 6.6631 6.1871 5.3873 212.0690 6.4704

Italic type indicates suggested outliers. b Known outliers.

RHM count and fairly large ∼d2 values for MCD and SHV. These values suggest that observation 10 is probably in the ill-defined area somewhere between being representative of the majority of the data and being a true outlier. HBK and RA-IR Data Sets. For the HBK data, results similar to those of the wood data were obtained, except that the MVT method identified all outliers, as did a plot of the first two normalized PC scores. For both the HBK and RA-IR data sets, the RHM and SHV methods clearly identified all outliers. These tests confirm the usefulness of the RHM and SHV methods. The squared Mahalanobis distances (d2) and hat matrix leverage values (hii) are untrustworthy due to their susceptibility to the masking effect. Because MVT relies upon d2 to order and trim observations, it is also unreliable. Unfortunately, there is no way to determine when MVT might work, as exemplified by the method’s different performance on the wood data set vs the HBK data set. While MCD performed well, that method requires numerous eigendecompositions and the number of eigendecompositions needed to find a subset of the data with an acceptably small determinant is never known. Simulation Study. The simulation was designed to examine how the data and outlier characteristics interact to affect RHM and SHV results and to compare these proposed methods with MVT and MCD. The number of observations and variables were Analytical Chemistry, Vol. 70, No. 11, June 1, 1998

2377

Table 3. Simulation Results for Outlier Detection Methodsa method

% perfect ID

% missed some

% total failure

FLOPS

MVT (15% trimming, 4 iterations) MVT (25% trimming, 4 iterations) MCD RHM (95% level of confidence) SHV

3.0 3.6 18.0 37.4 58.6

37.4 30.8 55.2 39.2 12.6

59.6 65.6 26.8 23.4 28.8

1.5554 × 1010 1.2720 × 1010 6.6757 × 1010 1.8031 × 109 1.3917 × 109

a Percentages given are for when the method perfectly identified all outliers (perfect ID), missed some outliers, or failed to identify any outliers. The number of floating point operations (FLOPS) required to finish the simulation is also listed.

varied over a range sufficient to create under- and overdetermined systems. A major advantage of the proposed methods is that the iteration step does not require matrix inversion, which is required in many other techniques. Although computationally expensive, SVD is recommended for all matrix inversions and is required for numerical stability whenever the number of variables exceeds the number of observations (as with most spectroscopic data).21 When the number of variables exceeded the number of observations, there were no deleterious effects on RHM or SHV performance. This result was expected, since the problem caused by underdetermined systems appears at the matrix inversion step, which RHM avoids entirely, and which SHV performs only once after a good estimate of the central location has been obtained. RHM has only one adjustable parameter, the number of samples. In the simulation, the number of samples (NS) was varied from 1 to 6 times the number of observations in each random data set. When the number of samples is at least twice the number of observations, RHM performs well. Simulation results (Table 3) include the percentage of times all outliers were perfectly identified, the percentage of times only some of the outliers were identified, the percentage of times no outliers were identified, and the number of floating point operations (FLOPS) required. No attempt was made to optimize the algorithms; optimization would greatly decrease the FLOPS required for the SHV algorithm, which already consumes the least computer resources of the methods tested. A common point made in the literature is that both MCD and MVE are likely to identify “good” observations falsely as outliers. In the simulations, these type I errors occurred with similar frequencies for all methods tested, so the exact numbers are not reported. For MVT, the trimming percentage (15 or 25%) had little effect on fraction of outliers identified (which was extremely poor, ∼3%). In both MVT simulations, the design factors were not correlated with the fraction of outliers identified (highest r2 was 0.0327, p ) 0.0115). Such poor performance is an inherent part of the method due to its initial use of the Mahalanobis distance. For the MCD method, the fraction of outliers detected was negatively correlated (r2 ) 0.1636, p < 0.0001) with the number of observations and with the fraction of outliers (r2 ) 0.1814, p < 0.0001) but slightly positively correlated (r2 ) 0.0230, p < 0.05) with distance of outliers from “good” observations. For the RHM method, the fraction of outliers detected was highly negatively correlated (r2 ) 0.6965, p < 0.0001) with the fraction of outliers and slightly positively correlated (r2 ) 0.0632, p < 0.0001) with distance of outliers from “good” observations. SHV exhibited the highest score on perfect identification of outliers (58.6%). Similar to RHM, the method’s performance was highly negatively cor2378 Analytical Chemistry, Vol. 70, No. 11, June 1, 1998

Figure 4. Simulation results for testing method breakdown points: (A) MCD; (B) RHM; (C) SHV.

related (r2 ) 0.5618, p < 0.0001) with the fraction of outliers and slightly positively correlated (r2 ) 0.0481, p < 0.0001) with distance

of outliers from “good” observations. Note that the highly significant p values cited here are more due to the large degrees of freedom than to the actual importance of the relationship. Statistical significance does not necessarily equal practical usefulness. Breakdown Evaluation. The most significant factors from the above simulation were varied to evaluate the breakdown points for the MCD, RHM, and SHV methods; MVT was not tested further because of its poor performance. Figure 4A shows the breakdown of the MCD method as the fraction of outliers and numbers of observations are increased. MCD has a breakdown point between 20 and 30% outliers, depending on these two factors. A similar breakdown point was found in simulations run by Rocke and Woodruff.44 The breakdown point of the RHM method (Figure 4B) varies between 20 and 45%, depending on the fraction of outliers and their multivariate distance from the rest of the data. For the SHV method (Figure 4C), the breakdown point varies between 25 and 45% outliers and depends on the same factors as the RHM method. The SHV method seems to breakdown much more suddenly, as opposed to the gradual declines in performance shown by MCD and RHM, but can tolerate higher fractions of outliers. CONCLUSIONS We have demonstrated that the failure of the Mahalanobis distance and hat matrix leverage values to identify outliers can carry over to the technique of multivariate trimming. Given the theoretical relations between these techniques and their demonstrated failures to handle multiple outliers, we recommend that their use should be avoided. While the masking effect may be dealt with by using a computationally complex method such as MCD or MVE, our results indicate that the presence of the masking effect may be determined through the visual inspection of PCA plots. However, correct confidence regions are not available for PCA unless more complicated techniques that generate a robust covariance matrix are used. While the ability of the first and second PCs to identify outliers is intriguing, the fact that the PCs become progressively more distorted seriously compromises any approach based upon selection of PCs to generate confidence regions. Lacking a priori

knowledge, one cannot know at which PC the distortion is so great as to make the Mahalanobis distances unreliable. Simulation demonstrates that the proposed RHM and SHV outlier detection methods possess breakdown points slightly better than the MCD method. The simulation results also indicate that, as is commonly thought, the most important factor affecting outlier detection methods is the fraction of outliers. Outlier size had little effect. Data sets with more variables than observations cause no problems for the RHM and SHV methods. The resampling upon which the RHM method is based does require moderate computational resources, while the SHV method requires only the calculation of a distance matrix. Both the RHM and SHV methods are theoretically simple and perform better than MCD. Once the half of the observations possessing the smallest volume is found, a procedure quite similar to that of the MCD is followed, making the SHV method really just a much simpler, and better, version of MCD. Either proposed method is recommended for the detection of multiple outliers in multivariate data. ACKNOWLEDGMENT The authors thank Marie C. Egan (Department of Psychology, The University of South Carolina, Columbia, SC) and Dr. Melinda Higgins (Georgia Tech Research Institute, Atlanta, GA) for their insightful comments. Support for this work was provided by both the U.S. Department of Energy via Cooperative Agreement DEFC0291ER75666 and by Award 97-LB-VX-0006 from the Office of Justice Programs, National Institute of Justice, Department of Justice. Points of view in this document are those of the authors and do necessarily represent the official position of the U.S. Department of Justice. Portions of this work were presented as paper 424 at PittCon97 in Atlanta, GA. SUPPORTING INFORMATION AVAILABLE Description and analysis of HBK and RA-IR data sets (7 pages). Ordering information is given on any current masthead page. Received for review July 15, 1997. Accepted February 13, 1998. AC970763D

Analytical Chemistry, Vol. 70, No. 11, June 1, 1998

2379