Anal. Chem. 1985, 57, 1449-1456
1449
Qualitative Near-Infrared Reflectance Analysis Using Mahalanobis Distances Howard L. Mark* Technicon Instrument Corp., Industrial Systems Division, 511 Benedict Avenue, Tarrytown, New York 10591
Downloaded by NANYANG TECHNOLOGICAL UNIV on August 24, 2015 | http://pubs.acs.org Publication Date: June 1, 1985 | doi: 10.1021/ac00284a061
David Tunnel1 Technicon Instruments Co., Ltd., Hamilton Close, Houndsmill, Basingstoke, Hants, RG21-lB2 United Kingdom
Near-infrared reflectance spectrometry Is becoming more and more popular for quantitative analysis. Its potential for qualitative analysis has been neglected, however, due to the lack of appropriate methods of treating the data that correspond to the multiple regression analysis used for quantitative calibratlons. This study describes the use of the multivariate technique called discriminant analysis; in particular the advantages of the approach of using Mahalanobis distances is investigated.
The techniques covered by the term near-infrared reflectance spectrometry have developed over a number of years since the initial work of Norris and co-workers (1-3). The initial development was for the analysis of agricultural products ( 4 , 5 ) ,but the applications have since broadened to include the analysis of many types of materials. A distinguishing characteristic of the technology is the calibration of the instrument through the use of multiple regression analysis ( 4 , 6). This has led to the development of the broad range of uses to which this type of instrumentation has been put, since it allowed workable calibrations to be generated without the need for explicit corrections for often unknown and possibly incorrigible error sources. However, the nature of the calibration process has limited the use of the technique (and the accompanying technology) to quantitative analysis of materials known to be present in the sample. More recently, other mathematical approaches to the handling of newinfrared reflectance data have been tried (7, 8) but these attempts have also been directed toward obtaining quantitative results from the available technology. Qualitative analysis via the use of near-infrared reflectance technology is virtually unknown. Rose (9) has distinguished 40 different pharmaceutical raw materials using the discriminant analysis capabilities of the SAS program package. Shenk et al. (10) have used the HAT matrix approach for the more limited case of determining whether forage samples for quantitative analysis came from the same population as the calibration samples. Identification of raw materials is a critical need in the pharmaceutical industry; pharmaceutical manufacturers have to account for their manufacturing materials from the point of entry into the plant and verify that each drum of raw material is what it is supposed to be (11). Consequently there is a requirement for a rapid measurement technique that can distinguish many different types of solid powders from each other. Near-infrared reflectance is a rapid measurement technique that is currently in routine use for performing analytical measurements on just that type of sample. For quantitative analysis the mathematical technique of multiple regression analysis has been employed for extracting the necessary information from the resulting mass of data; for qualitative analysis the corresponding mathematical technique is discriminant analysis.
THEORY A chemist examining a spectrum for the purpose of determining the nature of the sample giving rise to the spectrum normally concentrates his attention on the regions of the spectrum showing absorbance peaks and classifies or identifies the sample by matching the location and strength of absorbance peaks to those of known substances. It is possible to generalize this procedure by noting that, if the absorbance is measured with sufficient accuracy, then any wavelength where there are absorbance differences between the substances to be distinguished can serve to classify them. For illustration, we consider the three spectra in Figure 1, which represent starting materials for various pharmaceutical preparations. Two spectra of each sample are shown. These represent the high and low extremes of five readings of each sample. The spread between the spectra of each material is characteristic of reflectance spectra of powders. When repacked, different grains of the powder are near the surface, and the differences in orientation, particle size, etc. of the surface grains gives rise to differences in reflectance when different packs of the same material are measured. Thus, in the case of solid powders measured by reflectance, the generation of discriminating criteria is complicated by the need to distinguish the different materials in the face of this extraneous source of variation in the measured spectra. In this case it is easy, almost trivial, to distinguish these materials by eye; in the literature there are also mathematical approaches to the classification problem that make use of the entire spectrum (12). It is advantageous to have techniques available that do not require the entire spectrum in order to classify (identify) a substance. Instruments using a finite number of fixed filters, becoming more and more common in the near-IR spectral region, cannot use an algorithm based on a complete spectrum. Even when monochromator-based instruments are used, time can be saved by measuring only a small number of preselected wavelengths, In the case at hand, we note that each spectrum has an absorbance value a t the two wavelengths we will use to perform the classification, 1680 nm and 2090 nm. By plotting the absorbance readings of the five repacks of each of the three samples at 2090 nm vs. the corresponding readings a t 1680 nm, we obtain the results shown in Figure 2A. The three materials are represented by three groups of points. The groups are well-separated; this indicates that the spesra of the various materials are sufficiently different at there two wavelengths that these two wavelengths alone can be used to characterize the three materials. Clearly, more substances could be included in this set of materials to be distinguished as long as all were different at these two wavelengths; in such a case they would be represented as points in different parts of the plane containing the data. In general, of course, we could not expect to classify an arbitrarily large number of different materials using only two wavelengths. Rather, we would need many wavelengths,
0003-2700/85/0357-1449$01.50/00 1985 American Chemical Society
ANALYTICAL CHEMISTRY, VOL. 57, NO. 7, JUNE 1985
1450
,700 600
500 400
300 200
100 0000’
’
I
I 1400
1200
’
I
,
1600
’
I
I
le00 2000 WAVELENGTH lnm)
’
I 2200
I
2400
’
Downloaded by NANYANG TECHNOLOGICAL UNIV on August 24, 2015 | http://pubs.acs.org Publication Date: June 1, 1985 | doi: 10.1021/ac00284a061
Flgure 1. Spectra of three pure chemicals showing the variation in the spectra due to repacking of the sample.
corresponding to a large number of dimensions. For large numbers of dimensions we must abandon the visual approach and create a mathematical method for locating data in multidimensional space. A computer can handle any number of dimensions with equal ease, thus we can discriminate between different materials using as many wavelengths as necessary to distinguish them. We will generate this mathematical description for the two-dimensional case, thus relating it to the visual description. Consider Figure 2A, and note that each group is located in a different part of the space of interest. We will classify samples by defining the locations of the various groups in this space, and assign a sample to a group if it is “near” that group. The problem thus breaks down into two parts: locating the
groups, and defining a criterion for determining whether a sample is “near” a given group in the space. We define the position of a group in multidimensional space as the point corresponding to the mean value at each measured wavelength. This is illustrated in Figure 2A, where the mean at each wavelength for each group is indicated the point where the lines join at the center of the group is the “location” of the group in space. The determination of distance is a bit more complex. Euclidean distances are not optimum, for the following reason: If we consider the three groups in Figure 2, we note that each group is somewhat elongated, with the direction of elongation lying along an approximate 4 5 O line compared to the center of each group. If we consider points M and N in Figure 2B which are at the same distance from the center of group B, it is clear that point M is likely to be a member of group B, while point N is not, because point M lies along the direction of the elongation, while point N does not. We define a distance measure D in such a way that the equivalent Euclidean distance is large in those directions in which the group is elongated. This concept was introduced by P. C. Mahalanobis (13),and the quantity D, defining the “unit distance vector” in multidimensional space is called the “Mahalanobis distance”. The Mahalanobis distance can be described by an ellipse (or ellipsoid, in more than two dimensions) that circumscribes the data, as shown in Figure 2C. A much less obscure development of the equations is presented by Gnanadesikan (14). The distance D, from a point X to the center of a group XI, is described by the matrix equation
D2 = (X - X;)’M(X - Xi)
B
A
.36
f
t
‘“t
.32
+
.30 .2n
.24
.in
’
t’l
L * ,
.20
’
!
.22
’
!
.24
’
.26
’
I
.2n
’
‘ .30
.24
.in
‘
!
.20
‘
!
.22
‘
:
.24
‘
!
.26
‘
:
.28
‘
I
,30
i880NM
i6BONH
C
,26/ ,24
, i8
,@, .20
~
,
.22
.24
.
!
I21
,
~
.28
,
,
.30
iB8OHH
Flgure 2. Three pure chemicals of Figure 1 plotted in d twcxlimensional wavelength space: (A) data plus groupmeans: samples at equal Euclidean distances from group E; (C) data plus ellipses defining the limits of each group.
(B)data plus two hypothetical
ANALYTICAL CHEMISTRY, VOL. 57, NO. 7, JUNE 1985
DATA
LINEAR
DISCRIMINANT FUNCTIONS
1
COMPUTE MAHALANOBIS DISTANCESFOR UNKNOWNS TO
CLASSIFY
Downloaded by NANYANG TECHNOLOGICAL UNIV on August 24, 2015 | http://pubs.acs.org Publication Date: June 1, 1985 | doi: 10.1021/ac00284a061
W
Flgure 3. Flow of computation in discriminant analysis.
where D is the distance, X is the multidimensional vector describing the location of point X, is the multidimensional vector describing the location of the groupmean of the ith group, (X - Xi)’ is the transpose of the vector (X and A4 is a matrix determining the distance measures of the multidimensional space involved. The relationship of this equation to the previous discussion is apparant: the various sets of represent the groupmeans of the different materials to be distinguished. The distance measures described above are defined by the matrix M. It is convenient to consider the distance from the center of the group to the circumscribed ellipse as representing one standard deviation of the data. We can consider the boundary of a group to be three standard deviations away from the groupmean. Then we can assign a sample within three standard deviations of a group unambiguously to that group. Also, if two groups have their groupmeans within six standard deviations of each other, the boundaries of the two groups will overlap, and we are forewarned of the possibility of misclassification. Gnanadesikan (24) describes three variant ways to construct the circumscribing ellipses, calling them M1, Mz, and M3. M1 is a unit matrix; and the ellipsoid reduces to a sphere. The use of MI reduces the calculation of distance D to the calculation of Euclidean distances. Matrix Mz describes the inverse of the covariance matrix element x j k = C(xj- zjj)(xk - a,) (2)
xi
xi),
xi
where j and k are indexes that take on the values from 1 to the number of wavelengths in the discriminant training set, and x represents the data a t a particular wavelength. The summation is taken over all the samples belonging to a particular group. Use of Mz implies fitting a separate ellipsoid to the data for each material to be distinguished. Matrix M3 describes the inverse of the matrix formed by pooling the within-group covariance matrices of all groups. Use of M3 defines a common metric for all groups in the dataset, indeed for the entire multidimensional space. Figure 3 illustrates the flow of computation in discriminant analysis. Most of the standard texts and available computer programs use linear discriminant functions to perform classification. We have chosen to use the calculation of Mahalanobis distance, and, in particular, to use the calculation of the matrix M3 of Gnanadesikan as the basis of our approach to the
1451
discrimination problem for a number of reasons: not only is it conceptually simple but, in addition, it allows for a straightforward method for selecting the wavelengths to be used for the discrimination, allows detection of samples that were not included in the calibration dataset, produces warnings of possible misclassification of samples with similar spectra, allows detection of outliers during the calibration process, and allows for transfer of a discriminant calibration to another instrument. The use of the inverse pooled covariance matrix (M3), rather than using individual matrices for each different material as M2 requires, also means that fewer samples of each material in the training set need be measured, an important practical consideration if many materials are to be distinguished. The main disadvantage of using Mahalanobis distance for actual discrimination of unknowns (compared to using linear discriminant functions) is the amount of computation required. As shown in eq 1,the calculation of a Mahalanobis distance requires two matrix multiplications, considerably more computation than a linear discriminant function requires. However, the continuing decline in the cost of everincreasing amounts of computer power makes this consideration almost trivial. The use of these concepts in practice is straightforward. Data at the wavelengths available in an instrument are collected for several samples representing the various materials to be distinguished; these constitute the training set. The data at each wavelength are averaged separately for each material, forming the groupmean matrix. The pooled covariance matrix is formed by creating the covariance matrix for each group as described in eq 2, adding the corresponding terms of each individual covariance matrix, and dividing by n - m (where n is the total number of samples, and m is the number of groups). This matrix can now be inverted by any of the common methods (15). The groupmean matrix and inverse pooled covariance matrix in conjunction constitute what can be thought of as the calibration “equation”, even though they do not represent a single actual equation. To classify an unknown sample, the Mahalanobis distance D from the unknown to each of the materials in the training set is calculated from eq 1. The sample is assigned to the group it is closest to. In the usual case, a given sample will be within three times Mahalanobis distance from the group it belongs to, and far away from any other group. There are two noteworthy exceptions to this. The first exception is the case where a given unknown is farther from any group than three times the Mahalanobis distance (i.e., three standard deviations). Assuming that the discrepancy is not due to statistical variability or a change in the measurement technique, then this is an indication that the unknown sample is of a type not represented in the calibration teaching set. The other exception is the case where a given sample is close to more than one of the calibration samples. This can only occur if the calibration samples themselves have very similar spectra, and thus are close together in multidimensional space. This can be detected during the calibration. As discussed above, groups closer than 6 Mahalanobis distances overlap, potentially causing such misclassification. Wavelength Selection. In the absence of a priori knowledge of which wavelengths are suitable for performing the desired discriminations, a method of selecting the optimum set of wavelengths is needed. The method devised is to compute the distances DLjbetween all pairs of groups i and j , then form the sum of the inverse squared distance, Le., C(l/DJ2. The groups that are closest together will contribute most heavily to this sum; thus selecting those wavelengths that cause this sum to be smallest results in the selection of the wavelengths that best separate the closest groups, i.e., best
Downloaded by NANYANG TECHNOLOGICAL UNIV on August 24, 2015 | http://pubs.acs.org Publication Date: June 1, 1985 | doi: 10.1021/ac00284a061
1452
ANALYTICAL CHEMISTRY, VOL. 57, NO. 7, JUNE 1985
distinguish the most similar spectra. The groups that are far apart (Le., dissimilar) are no problem to distinguish among. This technique is similar to that used in the Statistical Program for Social Sciences, a commercial program package that selects variables based on the best separation of the closest single pair of groups; but the technique presented here will optimize among all groups that are comparably closely spaced. Transfer of Calibrations. Having generated a discriminant calibration, it is often useful to be able to utilize that calibration on several different instruments. Since the differences between different instruments result from unavoidable manufacturing variations, the effect on the readings is to cause a small change in the value of the readings taken at any given wavelength. The size and shape of the groups in multidimensional space are due mainly to the differences in reflectivity of the sample as different repacks, etc. are measured. Thus, to a first approximation, we can separate the effects due to instrument and sample; instrumental differences affect the groupmean matrix, and within-group sample differences affect the inverse pooled covariance matrix. Thus, in order to use the discriminant equation on a different instrument, we need adjust only the groupmean matrix to the readings from the new instrument. This adjustment can be performed on fewer samples than was needed to perform the original calibration; only one sample of each material is needed, in favorable cases. Normalization. The use of the M3 matrix, Le., the inverse pooled covariance matrix as the basis of calculating Mahalanobis distances in multidimensional space, implies that all of the groups have essentially the same size, shape, and orientation, so that only one specifying matrix is needed to describe them all. In NIRA technology, as we have seen, the nature of the reflectance phenomenon is such that the data at all wavelengths have a tendency to increase or decrease together. As demonstrated in Figure 2, this shows up as the tendency for the data points to lie along parallel lines. In three dimensions, the phenomenon would appear as a tendency for the data of the various groups to form parallel “needles” in space, and so forth. However, samples with large values of log (1/R) have larger variations in the data than samples with small values of log (l/R) (6). This suggests that, while different groups tend to have the same shape and orientation, they can differ in size. This could lead to false conclusions when analyzing unknowns, since pooling the data from large and small groups gives an incorrect estimate of the sizes of the groups. This could then lead to false rejection of samples that are actually part of a large group and false assignment of unknown samples to a small group when in fact the sample lies beyond the boundary of that group. To avoid this problem, we can normalize the Mahalanobis distance according to the sizes of the various groups that are in the training set. Having calculated M3, we can compute the root mean square size of each group by first calculating the distance D iof each sample in the training set from the groupmean of its assigned group, and then, for each group, calculate
i (GLIil) -
F ’ n 2 \I12
rms groupsize =
(3)
Dividing the Mahalanobis distance from a sample to each groupmean by the root mean square groupsize for the corresponding group will normalize the distances so that more accurate representations of the distance to the various groupmeans can be calculated, decreasing the possibility of error. Computing normalized distances in this manner is, in a sense, intermediate between the M2 and M3 matrices of Gnanadesikan. Whereas M2 assumes a different shape, size,
and orientation for each group, and M3 assumes the same shape, size, and orientation for all groups, the use of normalized distances assumes the same shape and orientation, but different sizes for each group in the calibration.
EXPERIMENTAL SECTION The principles outlined in the previous section were incorporated into a set of computer programs written in FORTRAN 77 on a Hewlett-Packard Model HP-1000 minicomputer. These programs include the capability of calibrating for discrimination, classifying unknowns after the calibration is generated, and adjusting the groupmean array after the calibration is generated. The calibration program contains extensive wavelength search capabilities, similar to those contained in regression programs (16). These computer programs are available from Technicon Instrument Corp. for use in conjunction with the Infrared Data Analysis System. Near-infrared spectral data were collected from a Technicon InfraAlyzer 500 grating spectrophotometer using a HewlettPackard HP-1000 minicomputer to collect the optical data. Data for an interference-filter-based instrument were collected on an InfraAlyzer 400 spectrometer. All computations were performed on the HP-1000. The samples were proprietary, and therefore not identified to us. They were divided into two categories. Samples to be used as the training set were marked with letters A through S. Samples to be considered “unknowns” were marked with numbers or number-letter combinations (e.g., 59D); the test of the discrimination process was to match the numbered samples to the correct letter sample. The spectrum of each sample in the training set was measured at 4-nm intervals between 1100 and 2500 nm on the grating spectrometer. The spectrum of each sample was measured five times, repacking the sample anew for each measurement. The set of “unknowns” was treated similarly, but each sample’s spectrum was measured only twice. In order to test the ability to transfer the discriminant calibration from the grating-based instrument to the interferencefilter-based instrument, all samples (training set and unknowns) were read twice on the filter-based unit.
RESULTS AND DISCUSSION The spectra corresponding to the “known” sample types A, B, and C are shown in Figure 1. The spectra corresponding to the remaining sample types are shown in Figure 4. Several computer runs were made to search for the wavelength combination that gave the best discrimination capability among these various sample types. Several wavelength combinations were found with approximately equivalent classification capabilities. For the purpose of the current report, the results reported were obtained by using a subset of the wavelengths corresponding to the filters in the filterbased instrument. This allows us to investigate the possibility of transferring the results from the spectrophotometer to the filter-based unit. The calibration consists of two matrices: the groupmean matrix and the inverse pooled covariance matrix. By use of this calibration, the Mahalanobis distances between the various groups can be calculated for the dataset from which the calibration was generated; these distances are displayed in Table I. Of particular interest are the distances between the closest pairs of groups. Of the materials investigated, the most similar (i.e,, the ones lying closest together in multidimensional space) are materials I and L, a t a distance of 13, and materials N and R, at a distance of 21. These materials are very close, especially compared to the distances between all the other group pairs. Comparing the spectra of these materials with each other, as in Figure 5 where the spectra are redisplayed for easier comparison, reveals that the spectra are similar not only a t the six wavelengths selected for discrimination, but indeed, look like two samples of the same compound with only physical (e.g., particle size) differences between them.
ANALYTICAL CHEMISTRY, VOL.
57,NO. 7, JUNE 1985 1453
Table I. Mahalanobis Distances between Groups in Calibration Set
to from
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
A
129
162 274
122 81 252
55 88 196 83
65 177 123 147 106
295 290 341 225 261 294
124 127 236 51 102 129 212
99 92 237 36 74 124 240 43
126 64 243 112 89 170 297 153 125
39 125 150 117 55 66 286 123 104 107
94 101 232 48 74 117 242 45 13 133 102
106 166 186 146 92 133 240 142 135 165 112 132
61 99 200 86 65 88 293 95 69 113 66 69 134
60 147 142 156 91 81 331 162 141 124 47 139 137 86
46 118 176 89 52 70 264 87 67 120 47 63 116 59 88
158 124 277 53 130 169 217 48 62 158 156 69 182 122 194 120
49 104 184 89 56 75 285 94 71 108 47 70 127 21 74 42 123
B C D E F G H I J K L M N Downloaded by NANYANG TECHNOLOGICAL UNIV on August 24, 2015 | http://pubs.acs.org Publication Date: June 1, 1985 | doi: 10.1021/ac00284a061
0 P
Q R
Table 11. Mahalanobis Distances from the “Unknowns” to the “Known”Groups for the Diffraction-Grating-Based Instrument known groups
unknown
group
A
B
C
D
50 51 52 53 54A 54B 64 66G 66H 68 691 79 82K 83
166 38 130 118 101 95 53 97 56 30 50 95 52 107
278 125 1 170 101 84 102 101 99 148 102 101 98 81
4 150 275 192 237 239 187 233 196 147 186 233 196 252
256 117 83 150 42 41 90 44 88 149 90 47 78 37
E
F
200 127 54 66 89 178 99 140 78 124 70 126 59 78 74 120 61 84 74 74 56 76 75 119 52 83 79 138 -
G
H
344 241 286 123 291 129 240 147 235 39 250 56 289 96 40 236 290 94 314 152 95 287 45 242 83 277 250 - 57
Table I1 displays the Mahalanobis distances for some of the “unknowns” to the various groups in multidimensional space. The distances for only one of the readings of each sample are presented the values for the second reading of each sample were essentially the same. Of the 40 “unknowns” we show the results for only a small number. The first four samples presented in Table I1 are representative of the usual sample: the Mahalanobis distance from the “unknown” to one group is very small, and to the remaining groups very large, making the assignment very clear-cut. However, the samples 54A, 54B, 64,66G, 66H, 68, 691, 79, 82K, and 83 all require further comment, because, except for sample 68 (which requires separate comment) these samples were all assigned to one of the group pairs (I and L or N and R) that were very close to each other in the calibration dataset. Indeed, each of these unknowns is close to both members of the pair, and far from all other groups, indicating that these samples are one or the other of these apparently identical materials. In order to determine the final classification of these materials into one or the other of these two groups, a closer examination consisted of plotting the spectra and comparing the entire spectra to the spectra of the calibration samples. Of particular note are samples 54B and 83 which are not only rather far from both of the groups that they are closest to (I and L) but are also approximately equidistant from the two groups, making a final assignment difficult. The spectra of all the other questionable samples are essentially identical with the known groups to which each sample was assigned;
I
J
241 247 103 107 94 63 139 171 11 134 13 119 72 107 11 132 67 112 128 137 71 107 12 133 55 111 19 122 -
K
L
154 236 1 101 126 103 120 137 10 108 17 101 72 51 7 103 61 66 49 124 48 70 1 103 52 58 26 114 -
M
N
O
P
190 112 167 8 131 136 130 129 131 108 128 133 126 144
204 66 100 140 75 65 16 72 6 84 20 70 28 74 -
146 48 147 144 146 137 75 142 83 49 74 141 89 148
180 46 119 123 70 66 48 65 53 74 44 64 34 80
Q
R
S
282 189 203 47 212 155 126 105 285 186 134 205 76 212 63 68 226 71 6 208 124 72 211 66 16 210 121 22 188 74 2 208 124 68 71 214 111 21 214 65 232 78 --
Table 111. Final Classification of Unknowns unknown
50 51
classified unknown classified unknown classified C
59D
A
60F 61 62 63 64 65 66C 66H 67 68 691 69J
60E
H
70
52 53 54A 54B 53 56 57 58 59c
K B M L
-
E 0
J G
P
S
71
Q
72 73 74
D
F
R K L N G A R
J E
5 C
M
--
Q
$3
H
76
P
-3
78
D F
79
1
I ,
80 81 82K 85L 83
B 0
N 1
-
however, the spectra of samples 54B and 83, displayed in Figure 6, show appreciable differences from any of the known spectra. This, coupled with the difficulty of making the assignment on the basis of the statistics caused us to not assign these samples. The final classifications arrived at are shown in Table 111. Subsequent communication with the supplier of the samples confirmed our assignments, we had matched the ”unknowns” with the correct “knowns” for all samples except sample no. 79, which was material I rather than material L. Samples 54B and 83, that we had not assigned, were “ringers”,with chemical structures very similar to the knowns (I and L), added to the
1454
ANALYTICAL CHEMISTRY, VOL.
57, NO. 7, JUNE 1985
RI
A 4.10-
6.10-
I
8.10-
,000T
II
B 7.10-
0.10-
9.10-
2.000T 1.1100 1,600
Downloaded by NANYANG TECHNOLOGICAL UNIV on August 24, 2015 | http://pubs.acs.org Publication Date: June 1, 1985 | doi: 10.1021/ac00284a061
0.oooJ
:
*Po0
10.10-
'
:
"
1400
11.10-
id00
'
:
'
:
1800 EQOO IAVELENITH (no)
'
:
2200
:
'
I
2400
0.000'
/I
18.10-
LPOO :
'
1S.iO-
1400 ;
11100 :
'
i7.10-
iB.10-
a 0: 0
'
2400 :
I
lS.10-
14.10-
I
E is.10-
' 11100 : ' 2000 : UAVRENBTH hnl
18.10-
2.000T 1. 1100 1, IO0
,200
o . o o o = : ' : ' : ' : ' : ' : ' 1200
1400
moo
iaoo 2000 UAVRMBTH (nm)
2200
:
e400
'
Flgure 4. Spectra of the 16 samples which, together with the samples whose spectra are shown in Figure 1, constitute the training set.
set to test our ability to reject samples not in the calibration training set. The supplier of the samples also confirmed that samples I and L are chemically the same, as are samples N and R. Thus purely physical differences may not be sufficient to allow discrimination between samples of the same composition in all cases. Sample 68, as mentioned above, requires separah comment. Although, as Table I1 indicates, this sample is closest (most similar) to the "known" sample type A and, indeed, was correctly classified as A, the actual distance to sample type A (Mahalanobis distance of 30) is very large. Theoretically, considering that Mahalanobis distance represents a measure of standard deviation, we should expect essentially all the samples to lie within three times the Mahalanobis distance
of their respective groupmeans. In practice, this is not the case, due to small perturbations of the data from the ideal situation that the theory holds true for. Small variations in the size, shape, or orientation between the different groups, small changes in the physical nature of the samples, small drifts in the instrument, etc. could all cause samples to appear to lie beyond the three standard deviation limit. Our experience indicates that a more reasonable cutoff point due to the presence of these factors lies at Mahalanobis distance between 10 and 15. For all samples other than no. 68 the distances between the unknown and the known samples were essentially the same regardless of the wavelength combination used to classify the samples. For sample no. 68, however, this particular combination of wavelengths gave a distance that
1985 1455
ANALYTICAL CHEMISTRY, VOL. 57, NO. 7, JUNE
Table IV. Mahalanobis Distances from the “Unknowns” to the “Known”Groups Using the Interference-Filter Based Instrument
known groups
Downloaded by NANYANG TECHNOLOGICAL UNIV on August 24, 2015 | http://pubs.acs.org Publication Date: June 1, 1985 | doi: 10.1021/ac00284a061
unknown group
A
50 51 52 53 54A 60F 64 66G 66H 691 79 82K 82L
110 35 40 118 80 253 101 77 109 99 76 77 81
,700
B
C
D
123 4 21 102 1 119 128 155 69 157 253 204 87 182 67 154 96 190 85 179 68 156 67 160 71 158 -
E
F
199 126 117 36 108 26 182 108 49 70 247 255 54 97 52 68 58 105 95 54 70 53 77 63 72 49 -
G
H
97 260 64 253 75 254 150 240 79 230 205 236 97 265 76 230 104 271 95 264 75 237 76 259 232 78 -
I
6oo
E
,400
,000
Q
R
S
101 215 182 184 94 224 25 145 41 140 84 230 136 215 176 230 80 42 216 72 23 240 238 254 3 235 98 79 69 82 43 216 14 237 108 82 1 233 95 80 39 220 68 84 21 228 74 91 216 80 40 73 -
189 97 102 32 92 42 179 139 49 102 256 251 8 115 51 99 5 123 10 113 99 46 30 94 48 103 -
,3004
:
’
1200
I
1400
’
I 1600
’
I 1000
’
:
2000
’
:
2200
’
I
2400
I
’
:
1400
’
:
1600
’
I
1000
’
:
2000
’
I
2200
’
~
1
2400
WAVELENGTH lnml
Unknown samples 83 and 548 have spectra that are identical except for an offset similar to that exhibited in Figure 1, although somewhat larger. Figure 6.
T
,500 ,400 3004
0
0 0 0 o L : 1200
lnml
.7OOt
0
P
,500
WAVELENGTH
0
O
t
0.oooJ
y
157 153 71 126 65 121 148 7 10 141 241 250 43 173 8 140 53 177 41 171 3 144 27 156 8 143 -
N
I
,300f
B
M
‘Oool
t
,500
104 2 21 132 76 245 98 73 107 95 74 75 77
L
,700
I
y
K
161 221 77 124 69 109 154 203 9 129 237 333 40 124 10 120 50 129 38 124 12 129 30 124 9 131 -
188 112 105 179 43 230 49 46 53 48 48 56 43
t
,
J
0
~
1200
:
’
:
1400
’
:
1600
’
:
1000
‘
I
2000
‘
:
2200
’
:
I
2400
WAVELENGTH lnml
Figure 5. Knowns that have almost identical spectra: (A) samples I and L differ only between 2200 and 2400 nm; (B) samples N and R
show only minor differences between
2200
and
2500
nm.
was much larger than other wavelength sets. Indeed, the actual reported classifications were based on the combined consideration of several of the different wavelength sets; from the values listed in Table I1 alone, we would have to conclude that we could not assign sample 68 to any of the known groups, since it is too far from all of them to be classified as any of them. There is further discussion of sample 68 below. T E S T OF CALIBRATION TRANSFERABILITY
Table IV presents the results of classifying the same set of unknown samples using a filter-based instrument. The calibration was transferred from the monochromator-based instrument to the filter-based instrument by correcting the groupmean matrix to the filter-based instrument as described in the ”theory” section. Two repacks of each sample were used in making the correction. For the majority of samples the
classification is again clear-cut; we thus present only a few of the results. Those samples belonging to one of the group pairs (I and L or N and R) that we now know represent the same materials under different physical conditions (samples 54A, 64,66G, 66H, 691,79,82K, 82L) show the same behavior as previously, Le., the unknown sample lies equally or acceptably close to both of the known materials. An exception here is sample 82K, which, although actually material N, lies sufficiently far away from all groups that it would be rejected as a sample that had not been previously seen by the algorithm. Samples 60F and 68 also lie farther from their correct (and all other) groups that they would be rejected as samples not previously seen by the algorithm. For sample 60F we note that the correct “known” group, group S, has a much higher absorbance level than the other groups. Higher absorbances have been seen to go hand-inhand with larger particle sizes and greater variability of the readings (17). This would tend to cause the measured Mahalanobis distance to be larger than the theoretical cutoff of three for those samples that are valid representatives of the group. For this case, normalization of the Mahalanobis distance, as discussed above, would be appropriate. Sample 68, on the other hand, does not have absorbances that are noticeably higher than the majority of the samples of the set. In this case we may conclude that reading a sample twice is not sufficient to accurately determine the groupmean for that material. Alternatively, since this same sample performed anomalously on the monochromator-based instrument, we can suspect that this unknown is actually in a different physical state than the knowns, although it is supposed to be the same. This conclusion emphasizes the requirement that training set samples used for any chemometric
Downloaded by NANYANG TECHNOLOGICAL UNIV on August 24, 2015 | http://pubs.acs.org Publication Date: June 1, 1985 | doi: 10.1021/ac00284a061
1456
ANALYTICAL CHEMISTRY, VOL. 57, NO. 7, JUNE 1985
method actually cover the range of variability of samples that the method is to be applied to. It is noteworthy that, while the algorithm accurately classified most of the samples, even in those cases where it failed it “failed safe”, in that none of the samples was classified incorrectly. At worst a few samples would not have been classified at all, rather than classified incorrectly. We see, however, that the unknowns 54B and 83 alvo come as close to the material I and L as the other samples that actually were that material. We have seen before, with the monochromator data, that the spectrum of the unknown material labeled 54B and 83 is almost identical with the known and, indeed, was included in the set of unknowns to act as a “ringer”. In the case of the monochromator data the differences between the spectra were sufficient to distinguish these samples from the actual Yknowns”;in the case of the filter-based instrument however, only a fortuitous coincidence of the filters available with the wavelengths where these small differences occur would allow distinguishing between such similar materials. We were not fortunate enough in this case for such a coincidence to happen; thus samples 54B and 83 would have erroneously been classified as material I or N had we had to rely only on the filtef-based instrument’s data.
CONCLUSIONS The use of Mahalanobis distances is an accurate means of classifying samples as one of a number of materials whose near-infrared spectral properties have been measured. Additionally it identifies those cases where the classification is uncertain because of similarities between the spectra of several of the known, or “standard” samples. It is fail-safe, in that anomalies in the data will cause a sample to not be classified, with the implied recommendation that such cases be verified using a different analytical technique. Where this method applies, we Can take advantage of the speed and simplified sample handling that are characteristic of the near-infrared reflectance analysis technique (28) for qualitative as well as quantitative analysis. Samples that have not been included in the training set are easily identified as such. This is evident from consideration of the data in Table 11. If, for example, material C were not included among the known samples, then unknown sample 50 would have material F as its nearest match, at a distance
of 127 times Mahalanobis distance. This is clearly beyond any reasonable bound for assigning a sample to any group, and so sample 50 would not be classified. A similar argument exists for the hypothetical deletion of any group from the training set. These considerations indicate that discriminant analysis, like the other multivariate chemometric methods, require alertness on the part of the analyst, in order to detect the warnings the algorithm gives. Using Mahalanobis distances for classification allows computer programs to be written that contain a large measure of protection against marginal conditions, in a form that is easily understood and evaluated by the non-mathematician.
ACKNOWLEDGMENT The authors wish to thank David Honigs for valuable discussions in the initial stages of this study. LITERATURE CITED Massie, D. R.; Norris, K. H. Trans. Am. Soc. Agric. Eng 1985, 8 (l),
598-600. Ben-Gera, I.; Norris, K. H. J. Food Sci. 1988, 33 (l),64-67. Ben-Gera, I.; Norris, K. H. Isr. J. Agric. Res. 1988, 18(3),117-124. Rotolo, P. Baker’s Dig. 1978, 52 (5), 24-36. Watson, C. A. Anal. Chem. 1977, 49, 835A-840A. Wetzel, D. L. Anal. Chem. 1983, 55, 1165A-1176A. Hrushka, W. R.; Norrls, K. H. Appi. Specfrosc. 1982, 36 (3),261-265. Honigs, D. E.; Freeiin, J. M.; Hieftje, G. M.; Hirschfeld, T. B. Appl. Specfrosc. l9b3, 37 (6),491-497. Rose, J. R. Second Annual Symposium on Near Infrared Reflectance Analysis; Technicon Instrument Corp.: Tarrytown, NY, 1982. Shenk, J. S.;Landa, 1.; Hoover, M. R.; Westerhaus, M. 0. Crop Sci. 1981, 27 (3),355-358. Ciurczak, A. Seventh International Symposium on Near Infrared Reflectance Analysis; Technicon Instrument Corp.: Tarrytown, NY,
1984. Poweii, L. A.; Hieftje, G. M. Anal. Chim. Acta 1978, 700, 313-327. Mahalanobls, P. C. Proc. Nafl. Inst. Sci. India 1938. 2 , 49-55. Gnanadesikan, R. “Methods for Statistlcal Data Analysis of Multivariate Observations”; Wiiey: New York, 1977;Chapter 4. Kennedy, W. J.; Gentle, J. E. “Statistical Computing”; Marcel Dekker: New York, 1980;Chapter 7. “Specifying Wavelengfh Searches for the Combinations Program”; Infranote; Technicon Instrument Corp.: Tarrytowrl, NY. Norris, K. H.; Williams, P. C. Cereal Chem 1984, 61,158. Honigs, D. E. Ph.D. Thesis, University of Indiana, Bloomlngton, IN,
1084.
RECEIVED for review November 29, 1984. Accepted March 14, 1985.