Fuzzy Grid Encoded Independent Modeling for Class Analogies

Apr 2, 2014 - A novel representation of chemical measurements has been devised for which the data are encoded as fuzzy grids instead of the standard ...
0 downloads 0 Views 4MB Size
Article pubs.acs.org/ac

Fuzzy Grid Encoded Independent Modeling for Class Analogies (FIMCA) Peter de Boves Harrington* Ohio University Center for Intelligent Chemical Instrumentation, Department of Chemistry and Biochemistry, Clippinger Laboratories, Ohio University, Athens, Ohio 45701-2979, United States ABSTRACT: A novel representation of chemical measurements has been devised for which the data are encoded as fuzzy grids instead of the standard convention as a vector. The fuzzy grid encoded data and data in the standard format were evaluated with soft independent modeling for class analogies (SIMCA). The fuzzy version of SIMCA is referred to as FIMCA. These two methods were compared with simulated and real data to characterize the advantages of the fuzzy grid encoding. For complex data, the FIMCA approach often achieves better results, and for simpler data sets the similar prediction results are obtained. The benefits of this approach are its simplicity, increase in rank of overdetermined data, and prevention of coincidental correlations with underdetermined data. This paper introduces the use of FIMCA as a method for untargeted (one-class classification) authentication of complex chemical profiles.

C

problem.4 This problem is especially relevant with respect to society’s pressing need to authenticate foods and pharmaceuticals and has found widespread use in quality control for process analytical chemistry. There are some specialized cases for which SIMCA applied to vector representations of chemical data may not work so well that will be demonstrated later on in this paper. This paper presents a procedure to transform analytical data to fuzzy encoded grids and then applies the same SIMCA procedure to the data in the new representation, which will henceforth be referred to as fuzzy independent modeling of class analogies (FIMCA). This paper will be focused on a carefully controlled comparison of SIMCA and FIMCA. Recent studies with FOAMs have shown that FOAMs may perform better than other conventional classification methods.5 The fuzzy grid encoding procedure is the inherent first step for constructing a FOAM, so it was worth investigating this approach as a preprocessing method for SIMCA. The advantages of this new approach to encoding chemical measurements are that for overdetermined data the rank of the data may be increased and for underdetermined data overfitting may be avoided by limiting the range of the intensity values that might result in coincidental correlation with principal components. In addition, by selection of the fuzzy functions, measurement registration errors such as retention time drift may be accommodated. However, this topic is beyond the scope of the present paper.

hemometrics from its inception has relied on representing multivariate chemical measurement objects as vectors. A new representation is proposed that uses a fuzzy encoded grid to represent these multivariate objects. Figure 1 gives an example of the vector representation that is used in virtually all data analysis and compares it to the fuzzy grid encoding that is inherent to fuzzy optimal associative memories (FOAMs).1 This approach can be extended to tensors for which the order of the tensor would be increased by unity. For simplicity, these fuzzy grids can then be unfolded into a vector format, but usually these vectors will have a much larger dimensionality. However, personal computers with increased memory are readily available and can accommodate larger data objects. For overdetermined sets of data, an increase in the rank of the data may also be obtained. This idea of expanding data to higher dimensions has been coined as the Copiosity Principle, which is the antithesis of the less-is-better approach that drives most research on feature selection and transformation methods to decrease the dimensionality. However, this method of grid encoding and its concomitant increase in the dimensionality results in losses of degrees of freedom with respect to the intensities of the data and can prevent overfitting that arises through coincidental correlations that occur frequently with underdetermined or megavariate data. In addition, the transformation to higher dimensionality may convert nonlinearly separable data objects to a space where they become linearly separable. This approach is used with kernels and support vector machines to overcome the limitation of nonlinearly separable data. The fuzzy grid encoding may be considered another kernel method2 with application to support vector machines.3 Soft independent modeling of class analogies (SIMCA) is an excellent approach to solving the one-class classification © 2014 American Chemical Society

Received: January 13, 2014 Accepted: April 2, 2014 Published: April 2, 2014 4883

dx.doi.org/10.1021/ac5001543 | Anal. Chem. 2014, 86, 4883−4892

Analytical Chemistry

Article

for which xL̅ is the average of the training objects XL, SL is a p × p diagonal matrix of singular values, and VL is an n × p matrix of principal components. The m × p matrix UL of normalized scores can be discarded if eq 2 is used. Otherwise, ULSL can be used to obtain the projections of the training data set L on the basis VL. The reconstruction error is EL. The reconstructed data are obtained for model L by xî , L = (xi − xL̅ )VLVTL + xL̅

(2)

for which x̂i,L is the reconstructed data object using the p components of VL. Typically, Hotelling’s T2 statistics can be calculated from the scores of the objects projected onto the principal components, but for this paper, this statistic was not used. The Q statistic is another independent measure of the residual error or the variance excluded from the principal component model. This report will use the Q statistic exclusively for the different evaluations. The QL statistic for model L is calculated as j=1

QL =

∑ (xi ,j − xî ,j , L)2 n

for which n is the number of calibration variables, xi,j is the intensity of object i and variable j, and x̂i,j,L is the corresponding reconstructed object from the principal component model VL with p components. There are a couple of approaches for establishing confidence limits for Q. Although the calculation of Q is essentially the same, multiple criteria have been published for establishing the confidence intervals. Perhaps the simplest criterion uses an F critical value of Fn−p, (n−p)(m−p−1);0.95.6 For underdetermined systems, it is perhaps best to replace n with the rank of the calibration data. A more robust, albeit complex, critical value is based on the normal distribution and was given by Jackson and Mudholkar for processing control.7 This criterion is exclusively used in this work and has been reported by other researchers.8 The calculations are given below

Figure 1. (Top left) Data for which each data point is represented as a component of a vector on the figure to the right. The figure below shows the same data represented as a gridded image.

FOAMs have proven useful for the authentication of natural products by chemical profiling. Because many applications of authentication require untargeted analysis, the ideal algorithms can perform modeling or one-class classification. The correct recognition of the target class is referred to as sensitivity, i.e., the proportion of correct recognitions for the target object. For a foreign object, one that is different from those used to define the target class, the model should reject it and return a null classification result. For this case, specificity, i.e., the proportion of correct recognition of a negative class, can be used. However, unity minus the specificity gives the false positive rate and for this presentation is more useful.

⎡ ⎤1/ h0 zα 2θ2h02 θ2h0(1 − h0) ⎥ ⎢ Q α = θ1 +1+ ⎢ ⎥ θ1 θ12 ⎣ ⎦



THEORY For all comparisons in this paper a 95% confidence interval was used for modeling. All parameters were maintained the same for SIMCA and FIMCA. The only difference was the grid encoding step for data submitted to FIMCA; in fact it may simply be considered a data transformation step prior to SIMCA. A brief overview of SIMCA is presented. Principal component models or orthogonal bases are calculated from data sets of individual classes. In the studies presented here, the data are always mean centered prior to calculating the principal components. The m rows of data matrix X are objects, and the n columns are measurements or variables. Singular value decomposition (SVD) is used to obtain the principal component matrix V, and by squaring the singular values that comprise the diagonal matrix S, the eigenvalues Λ may be obtained. The matrix of eigenvalues is divided by m − 1 to convert the eigenvalues to variances, which is important if eq 4 will be used to calculate the upper confidence limit. The SVD of the training data set XL yields XL = xL̅ + ULSLVTL + E L

(3)

(4)

min(m , n)

θk =

λik

∑ i=p+1

h0 = 1 −

(5)

2θ1θ3 3θ22

(6)

for which zα is the standard normal deviate corresponding to the upper (1 − α) percentile and the θk’s are the sums of the eigenvalues or variances raised to the kth power for the principal components that have been excluded from the SIMCA model. SVD is also used for the principal component analysis (PCA). The principal scores of the entire sets of data are used to efficiently display the distribution of the objects of a data set in a two-dimensional figure. The procedure for converting data to a fuzzy grid is quite simple. One may think of it as printing a spectrum or chromatogram on a sheet of graph paper. The grids of the graph paper that contain the analytical signal (e.g., ink from the printer) are encoded with a value of unity, and all the empty

(1) 4884

dx.doi.org/10.1021/ac5001543 | Anal. Chem. 2014, 86, 4883−4892

Analytical Chemistry

Article

Figure 2. (Top left) A segment of a grid encoded chromatogram. (Top right) The same chromatogram after fuzzy convolution with a horizontal triangular membership function. (Bottom left) The grid encoded chromatogram after fuzzy convolution with a vertical membership function. (Bottom right) The grid encoded chromatogram after fuzzy convolution with a two-dimensional triangular membership function.

zero that will produce N points for the fuzzy function. The N in the denominator is also divided by 2, but this time the ceiling function is used, which finds the least integer that is larger than half of N. An object xi can be converted to a grid G as follows

grids are encoded as unity. For two-way data objects (e.g., retention time × wavelength), the data are unfolded into a vector, and the same gridding procedure is used. Unfolding is simply concatenating the columns of a matrix to form a vector. The grid procedure is now described in detail, and pseudocode is given in the Appendix. A set of data objects that are encoded as vectors is used to define the grid. The gridding procedure is applied to the training or calibration data. Usually, one would like to subtract the average of this set of objects first, which helps improve the efficiency of the gridding procedure. The number of grids is defined and in this paper is only defined for the ordinate (i.e., intensity data). Gridding along the abscissa is equivalent to binning, and many binning routines already exist to accomplish this task. So usually, we will define each variable or element of the data vectors as a horizontal or abscissa grid. The maximum and minimum values of these data will define the range of the grid, and dividing the range by the number of grids minus unity will yield the grid increment Inc. The grid is then further extended by the number of points for the fuzzy function with respect to both the minimum and maximum points. This step is important because it simplifies the convolution step with the fuzzy function and allows prediction values outside the grid to be detected if the fuzzy functions overlap between a training object and the prediction object. In this report, a triangular fuzzy function is used and works best for many cases. The triangular membership function is created by selecting a reciprocal of a natural and odd number N (i.e., an integer excluding zero). The fuzzy function ff is defined as ff(i) = 1 −

⎢ N⎥ ⎢N⎥ |i| ⎧ ⎨i = ⎢ − ⎥···⎢ ⎥ N ⎣ 2⎦ ⎣2⎦ ⎡ ⎤ ⎩ ⎢2⎥

g (i , j , k ) = 1

⎧ ⎢ x(i , j) − min(X) ⎥ ⎨k = ⎢ ⎥ ⎣ ⎦ Inc ⎩

⎧ ⎢ x(i , j) − min(X) ⎥ g (i , j , k) = 0 ⎨∀ k ≠ ⎢ ⎥ ⎣ ⎦ Inc ⎩

(8)

for which the intrinsic order of the object will increase by unity. For example, a spectrum which is a vector will be represented as a matrix after grid encoding. In eq 8, the data matrix X has been inflated to a tensor. The grid is defined by the range of values contained in the data matrix X, and as mentioned earlier it is a good idea to expand this range to accommodate convolution with the fuzzy function ff. The increment Inc is the size of each grid with respect to values of the data and is defined as Inc =

max(X) − min(X) Ngrid − 1

(9)

for which Ngrid is the number of grids used to encode the ordinate. Next, a matrix G is defined for which the columns correspond to the original ordinate of our data, and the rows correspond to the intensity grids. Each intensity value from the data object should fall in one of the grids, and that grid will have a value of unity. Figure 2 gives an example of gas chromatogram segment and the results of the gridding procedure using the standard number of 100 grids, so each grid spans 1% of the maximum range of the entire chromatogram. The top left panel is a grid encoded chromatogram segment before the fuzzification step (i.e., fuzzy logic term for applying the

(7)

for which the fuzzy function ff(i) is defined for an interval between positive and negative values of the largest integer less than half N. The indexes i are integers that are symmetric about 4885

dx.doi.org/10.1021/ac5001543 | Anal. Chem. 2014, 86, 4883−4892

Analytical Chemistry

Article

memory. The operating system was Microsoft (Redmond, WA) Windows 8 Enterprise 64 bit. All Windows updates were installed at the time they were issued. Two synthetic data sets were constructed in MATLAB to compare SIMCA and FIMCA. The first set was a simple unimodal data set. A signal comprising four Gaussian peaks located at 40, 80, 120, and 160 point numbers in a 200 data point object with amplitudes of unity and standard deviations of 5 were added to random normal deviates at a mean of zero with a standard deviation of 0.2 to furnish 100 positive objects with signal-to-noise ratios of 5. A second group of negative objects of the same size (i.e., 100 objects × 200 points) was comprised without adding the signal component (i.e., the four Gaussian peaks). Because the data are synthetic, the process was duplicated so as to prepare a second training set of analogous data (i.e., same signal different noise) that was used to characterize performance, and the second set of negative data was merged with the first to yield 200 negative objects. The second data set was bimodal and designed to show the advantages of fuzzy grid encoding for complex data. For this case, the positive signal was constructed by adding random amplitudes in the interval of [0.5, 1.0] of two Gaussians at 40 and 80 or 120 and 160 point numbers so that 100 positive objects were furnished. Negative objects were formed by adding random amounts that varied between [0.5, 1.0] of the average of the positive signal to the random noise deviates. The random normal deviates had a standard deviation of 0.05, so this data set had a signal-to-noise ratio that ranged from 10 to 20. A reference data set9 of olive oil that comprised eight fatty acid measurements of 571 olive oils from nine regions was used as-is or modified by removing the olive oils that belonged to regions which had fewer than 50 objects (Calabria, Inland Sardinia, and Sicily). Because each class or region of olive oil was modeled individually, the lack of rank and degrees of freedom affected the evaluation. This data set was not autoscaled; however when the data were autoscaled the performance for SIMCA improved. This result is not reported here. A gas chromatographic/mass spectrometric study of 40 military grade jet fuels and 10 commercial grade jet fuels furnished a megavariate data set. The mass spectrometer was operated in electron ionization mode (70 eV). Five replicates were run for each sample by following an autosampler sequence generated by random block design. A solvent blank was run before and after each block to validate the lack of carryover with three cycles of syringe washes before and after each injection. All experimental data were collected on a Trace-GC 2000 gas chromatograph (GC) equipped with a Thermo Finnigan Polaris Q QIT-MS (Thermo Electron Corporation, San Francisco, CA, USA) as the detector. The gas chromatograph was also equipped with a TRIPLUS AS autosampler (Thermo Scientific). The Xcalibur software version 1.4 (Thermo Scientific) was used for the instrument control and data collection. The separation was accomplished with a 0.25 μm film of a polydimethyldiphenyl siloxane (5% phenyl) [DB-5, Agilent Technologies] wall coated open tubular column with a 30.0 m length and a 0.25 mm internal diameter. The initial temperature was 50 °C and held for 5 min, increased at a rate of 10 °C/min to 220 °C, and held for 5 min at 220 °C. A 1.8 min solvent delay was used under the split mode with a split ratio of 1:20. A flow rate of 1.5 mL/min of carrier gas helium was maintained by the flow controller. The retention times were binned from 1.9 to 27 min using a 0.01 min increment, and the mass measurements were binned

triangular membership function). The top right panel is the result of applying a triangular fuzzy function with respect to the abscissa, and the bottom left panel is the result of applying the same function with respect to the ordinate or the grid direction. The bottom right panel is the result after the application of a two-dimensional triangular membership function. It might be tempting to use a standard convolution of the fuzzy function with the grid data, but this presents a problem when the fuzzy function points overlap, because the values will be summed and could violate the laws of fuzzy logic that limit the membership functions to a maximum of unity. It is better to use a logical function that will replace an overlapped fuzzy value with the larger fuzzy value. Fuzzy convolution of the unfold grid is given by ⎛ N N ⎞ ⎟ g (i , j)F = max⎜ g (i , j + k)∧ff(k): k = − ··· ⎝ 2 2 ⎠

{

}

(10)

for which the fuzzy grid g(i,j)F for object i and element j is the maximum of minimum values obtained between the triangular fuzzy function ff and the unfolded binary encoded grid object. The computational power can be greatly reduced by removing grid values that are zero for all the training objects. Once the grid matrices have been unfolded into vectors, then columns of the unfolded grid encoded matrix that are all zeros can be culled, thereby compressing the data matrix. A simple index of the positions of the columns with values greater than zero is maintained. This approach is also much faster than using sparse matrices. The output of the gridding procedure is a matrix for which the rows are the training objects and the columns are the unfolded grids. These unfolded fuzzy grids are now in a matrix format, and conventional chemometric methods may be applied. Note that there are two mean-centering steps. The first occurs before the grid encoding. The second mean-centering occurs before the singular value decomposition step of the unfolded gridded data. For prediction, the grid information from the training set is saved. The mean of the training data is subtracted from each prediction object, and then the grid from the training set is applied. If the prediction object has intensities that fall outside of the range of the grid, a bias is calculated that is equal to the sum of squares of the fuzzy function for every prediction object value that falls outside the extended range of the grid. This bias will later be added to the residual error that is characterized by the Q statistic. The fuzzy function is applied to the prediction object grid, and the grid matrix is unfolded. Then using the index that was stored for the training data, the grid will be reduced by removing the grids that do not correspond to this index. Values for grid points that are removed from the prediction object are squared and added to the bias. The reduced grid then has the second mean from the calibration set subtracted from it before reconstruction with the SIMCA model or orthogonal basis. The residual error QL is obtained by using eq 3 for which the sum of squares of the differences between the reconstructed and prediction grid object. There is no divisor, as is the case for eq 3, because the QL is calculated for each prediction object.



EXPERIMENTAL SECTION All code was implemented in MathWorks (Natick, MA) MATLAB 2013b on a home-built Intel Core 7−3930K computer that operates at 3.20 GHz equipped with 64 GB of random access 4886

dx.doi.org/10.1021/ac5001543 | Anal. Chem. 2014, 86, 4883−4892

Analytical Chemistry

Article

Figure 3. (Top left) Synthetic positive data set of 100 objects at a SNR of 5. The negative data were similar noise distributed about a baseline of zero. (Top right) Principal component score plot of the positive or training objects and negatives, the prediction objects. (Bottom left) Comparison of SIMCA and FIMCA with respect to the number of latent variables. Note, the classification rate is for a similar set of training data (T) and not the training data itself. The values labeled as training T are the sensitivities. The prediction rates are labeled as (P) and correspond to the false positive rate. (Bottom right) FIMCA using a 19-point triangular membership function applied to the grid encoded ordinate.

from 60 to 424 Th using a 1.0 Th increment. Therefore, each object comprised 2510 retention time measurements and 3642 mass measurements so that the total number of variables or data points per object was 9 141 420. Each object was baseline corrected by orthogonal projection onto the best-fitting basis obtained from the blank runs. The retention times were aligned using a fourth order polynomial that maximized the correlation of the two-way GC/MS image with that of the mean GC/MS image. The two-way objects were unfolded and normalized to unit vector length. The final data set was obtained by measuring mixed powders of Panax quinquefolius (i.e., a species of ginseng native to North America) that was grown in the U.S. or China. A 10 mL aliquot of 50% aqueous methanol solution was added to 200 mg of the powdered ginseng samples. The samples were sonicated and centrifuged, and then the supernatant was filtered. The supernatant was analyzed by direct infusion into an LCQ Classic ion-trap mass spectrometer (Thermo Fisher Scientific Inc., Waltham, MA). The spectra were collected with mass measurements at unit mass-to-charge ratios from 100 to 2000 Th.

The spectra were normalized to unit vector length without any further preprocessing. A more detailed description of the experimental measurements has been reported.5c This work made a small deviation from the previous report in that the previous report used a square root transform of the mass spectra, and this work did not.



RESULTS AND DISCUSSION Two simple synthetic data sets were constructed as described in the Experimental Section. Negative objects were constructed without a signal and only contained the random deviates (i.e., noise). One set of 100 positive objects was used to construct the SIMCA and FIMCA models. The other set of positive objects was similar in that it was constructed in the same manner, and the only differences were from the random noise components. This second set was used to evaluate the models. Therefore, the models were evaluated with 200 negative and 100 positive prediction objects. The top left corner of Figure 3 is a plot of one of the positive training sets of data. The top right corner of Figure 3 is a PCA 4887

dx.doi.org/10.1021/ac5001543 | Anal. Chem. 2014, 86, 4883−4892

Analytical Chemistry

Article

Figure 4. (Top left) Bimodal positive data at an SNR of 20 for which positive objects contain two peaks at positions 40 and 80 or 120 and 160. (Top right) PCA score plot of the raw data. Note that the data are linearly inseparable. (Bottom right) PCA score plot of the fuzzy grid encoded data using standard conditions. Note that the data are now linearly separable. (Bottom left) Comparison of FIMCA and SIMCA with respect to component number. The training data T is the sensitivity and the prediction data P is unity minus the selectivity.

score plot of the data. The bottom left corner is a comparison between SIMCA and FIMCA for the grid encoded data. The grid encoding is standard in that the intensity axis was divided into 100 grid elements, and the number of 200 variables was maintained. One can see a clear difference between the grid encoded FIMCA and vector encoded SIMCA classification performance. The lower half of Figure 3 comprises the classification rates. On the left is the classification rate for using grid encoding without a fuzzy function. The results on the right are using grid encoding with a 19 point triangular function. The training set measures the sensitivity, and the prediction rate measures the false positive rate (i.e., unity minus the specificity). SIMCA performs well with the prediction objects being rejected from the positive data model, and as one would expect a first order model (one component) works ideally. As components are added, the similar training set sensitivity decreases. See the bottom left corner of Figure 3. For FIMCA there is a modest false positive error at one component, but then it decreases to zero after the second component. However, the sensitivity

(classification of the positive data) decreases with component number. The bottom right figure gives a similar comparison between SIMCA and FIMCA, but this time a fuzzy encoded grid was used. The standard fuzzy membership function in this paper is a triangular function that increases linearly from 0.1 to 1.0 and then back to 0.1 with a 0.1 increment that yields 19 points with respect to the intensities (y axis) and not the data points (x axis). For this case, both FIMCA and SIMCA properly reject the negative data and obtain a zero false positive rate. However, FIMCA is less affected by the selection of the number of components with respect to maintaining sensitivity. The positive objects in these results were not used for training but are similar to the training objects. For SIMCA, as additional components are added beyond 2, the sensitivity decreases and the false negative rate increases. Note that the classification rate for the training data or sensitivity for SIMCA differs between the left and right plots because two similar training sets were used to evaluate the classification rates. FIMCA exhibits similar behavior but at greater component numbers, so one advantage is that the 4888

dx.doi.org/10.1021/ac5001543 | Anal. Chem. 2014, 86, 4883−4892

Analytical Chemistry

Article

using 100 bootstraps with 10 Latin partitions.10 Because a 95% confidence interval was used to define the models, a 95% recognition rate should be achieved for the training objects. These data were not autoscaled but used directly without any preprocessing. The prediction results were good but not great, so olive oils with fewer than 50 objects were culled from the data set (i.e., Calabria, Inland-Sardinia, and Sicily). The top left plot gives the PCA scores for this reduced data set, and the top right plot gives the PCA scores of the fuzzy grid encoded data. SIMCA and FIMCA were then compared for the classification of the remaining six regions in the bottom right plot. Once again, standard conditions of 100 intensity grids and a 19 point triangular fuzzy membership function were used. For both the full data and reduced data, the SIMCA performance was best for four and three components, respectively. However, for FIMCA selecting the correct number of components is not as important. Probably, there exist different numbers of optimal components for each region of olive oils, and this evaluation applied a onesize-fits-all approach for both SIMCA and FIMCA. A plausible explanation for the better performance of FIMCA is that it is less sensitive to the number of components used in the model. When the data were autoscaled, SIMCA and FIMCA had equivalent classification rates. However, as mentioned earlier, SIMCA’s maximum prediction rate occurred for a single component number, and the FIMCA equivalent prediction rate occurred across a larger range of component numbers. Another evaluation compared the algorithms with megavariate data obtained from GC/MS measurements of jet fuels. For this case, eight military grade jet fuels with five replicate measurements were used to build a single model, and the model was tested for recognizing two commercial grade jet fuels with five replicate measurements. For a comparison, the results were obtained from the full data and the compressed data using the principal component transform (PCT), which is a lossless compression method. No data are lost from the training set; however the prediction set is projected onto the principal components of the training set and suffers a loss of information during this step. Figure 7 gives the results for the comparisons of the uncompressed and PCT compressed data. Both SIMCA and FIMCA performed well for the uncompressed data. SIMCA also is much faster. For the compressed data, FIMCA achieved a zero false positive rate after 14 components were added to the model, while SIMCA had a higher false positive rate. The eigenstructure of the projected data is the same as that of the training data, while the nonlinear fuzzy grid encoding of the PCT data alters the eigenstructure. Although it is unnecessary to use the PCT for SIMCA alone, it is helpful for other chemometric methods. If SIMCA is part of a suite of modeling routines for data that has been compressed by the PCT, a loss of modeling performance could be observed. Fuzzy grid encoding of the PCT data before the SIMCA analysis may help prevent this loss of performance. The last evaluation used direct infusion mass spectrometry of a species of ginseng (Panax quinquefolius) that was grown in the U.S. and China. For this authentication example, U.S. ginseng powder was diluted with Chinese ginseng powder to furnish samples that contain 100, 90, 80, 50, and 0% U.S. ginseng. The model building data set comprised the 100 and 90% mixtures of U.S. ginseng, and the prediction sets contained the other more diluted mixtures of ginseng. Figure 8 gives the principal component scores that were obtained from the normalized mass spectra on the left. On the right side, one can see that both

prediction accuracy is not as sensitive as SIMCA to the selection of the number of components for the model. Optimization of the number of components may be difficult for complex problems, especially with multimodal data or data that are variable with respect to time. In addition, the utility of fuzzy encoding of the grid images has been demonstrated. The next synthetic data set was designed to be complex and bimodal so that SIMCA or any other vector based chemometric approach would have a difficult time modeling the data. The positive signals comprise two Gaussian peaks at either 40 and 80 or 120 and 160 point numbers, see Figure 4 top left. The negative signals comprise objects at different SNRs that comprise all four peaks instead of only two at a time. The top right plot is the PCA score plot of the data objects, and they are linearly inseparable. The bottom left plot is the PCA score plot of the fuzzy grid encoded objects, and one can see that the data are now linearly separable. The bottom right plot is a comparison between FIMCA and SIMCA. In this plot, the negative samples were clearly recognized and rejected from the model by FIMCA, while SIMCA could not discern the differences between the positive and negative prediction samples. FIMCA also performed better than SIMCA with the set of similar positive samples. The positive samples were not used for model building but are similar to the set of positive objects that were used for training. Another study used a reference data set of eight fatty acid measurements for 571 olive oils from different regions of Italy. For these data, the rank is plotted with respect to the intensity grid number in Figure 5. For overdetermined data, fuzzy grid

Figure 5. Plot of rank versus intensity grid number for the reference olive oil data set. Note that the rank is limited by eight variables and the 571 objects.

encoding provides a method to increase the rank of the data set until it will reach the number of objects. For large collections of data with few measurement variables, this approach can add more degrees of freedom for modeling by multivariate methods. The olive oils were classified into one of nine regions using SIMCA and FIMCA in Figure 6 (bottom left). A winner-take-all algorithm was used so that an object was assigned to the best fitting class. The fit was determined by dividing the Q statistic by the corresponding confidence limit of each model. For this data set, the training data were included in the model building. Average classification and confidence intervals were obtained by 4889

dx.doi.org/10.1021/ac5001543 | Anal. Chem. 2014, 86, 4883−4892

Analytical Chemistry

Article

Figure 6. (Top left) PCA score plot of the reduced Italian olive oil data set. (Top right) PCA score plot of the same data after fuzzy grid encoding. (Bottom left) Comparison of FIMCA and SIMCA using 100 bootstraps and 10 Latin partitions using the entire set of data. (Bottom right) A similar comparison of the reduced data had the three smallest groups removed. The dots represent the 95% confidence intervals.

Figure 7. Comparison of SIMCA and FIMCA for the classification of military grade (T) and commercial grade (P) jet fuels from (left) two-way GC/MS data objects that comprise over 9 million variables and (right) principal component transformed two-way GC/MS data objects that had 50 variables. 4890

dx.doi.org/10.1021/ac5001543 | Anal. Chem. 2014, 86, 4883−4892

Analytical Chemistry

Article

Figure 8. (Left) Principal component scores for the training and prediction data. The figure on the right gives the classification rates for the 100−90% blended ginseng (T) (sensitivity) and the 800% blended ginseng samples (P) (false positive).

diversity of fuzzy functions and grid configurations exists to provide a broad avenue for future investigations.

modeling methods perform effectively. However, by using the fuzzy grid encoding and increasing the number of variables, the model can reject the adulterated data with fewer numbers of components and is more efficient at maintaining a higher sensitivity for the training objects and higher specificity exhibited by a lower false positive rate.



APPENDIX PSEUDOCODE



AUTHOR INFORMATION



CONCLUSIONS For cases that require modeling complex data, e.g., multimodal or nonlinearly separable, transforming the data to a fuzzy grid representation may improve performance. The benefit for overdetermined data is the rank may be increased by fuzzy grid encoding. Constraining the ordinate with fuzzy grid encoding may help prevent overfitting of underdetermined data. The disadvantage is that the size of the data is increased, which creates a heavier computational load. However, with advances in computer technology, specifically parallel processing and increasing memory capabilities, modern personal computers can accommodate this increased load. The fuzzy functions allow constraints to be relaxed with respect to the ordinate and abscissa of the grids. This first paper evaluated the effect of the fuzzy function with respect to the ordinate or the intensity of the data. For the comparison of FIMCA and SIMCA, the use of the fuzzy grid encoding decreases the sensitivity with respect to selecting the number of components and relieves the burden of precise optimization of the component number for each model. In addition, in some cases, more efficient models may be obtained with fewer numbers of components. More efficacy models with better prediction results may be achieved for complex data that are multimodal and/or are overdetermined data by expanding the multivariate rank. The conventional method of representing chemical measurements as vectors should be used for simple problems, but for complex pattern recognition problems that arise in proteomics, metabolomics, and authentication, fuzzy grid encoding may provide a new approach for representing the data and an alternative means for solving challenging problems through pattern recognition. This paper only presented a single fuzzy function in a single mode of application to a uniform grid. A

Corresponding Author

*Tel.: +1-740-994-0265. E-mail: [email protected]. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS The Center for Intelligent Chemical Instrumentation is acknowledged for support of this work. Drs. James Harnly, Pei Chen, and Xiaobo Sun are thanked for supplying some of the data used for this paper. Dr. Xiaobo Sun, Dr. Zhengfang Wang, Mengliang Zhang, Xinyi Wang, and Ahmet Aloglu are acknowledged for their helpful comments and criticisms. 4891

dx.doi.org/10.1021/ac5001543 | Anal. Chem. 2014, 86, 4883−4892

Analytical Chemistry



Article

REFERENCES

(1) Wabuyele, B. W.; Harrington, P. D. Appl. Spectrosc. 1996, 50 (1), 35−42. (2) Czekaj, T.; Wu, W.; Walczak, B. J. Chemom. 2005, 19 (5−7), 341− 354. (3) (a) Vapnik, C. C. V. Machine Learning 1995, 20 (3), 273−297. (b) Wang, G. Y.; Ma, M. Y.; Zhang, Z. Y.; Xiang, Y. H.; Harrington, P. D. Talanta 2013, 112, 136−142. (c) Xu, Z. F.; Bunker, C. E.; Harrington, P. D. Appl. Spectrosc. 2010, 64 (11), 1251−1258. (d) Yang, F.; Tian, J.; Xiang, Y. H.; Zhang, Z. Y.; Harrington, P. D. Cancer Epidemiol. 2012, 36 (3), 317−323. (e) Zhang, J. J.; Zhang, Z. Y.; Xiang, Y. H.; Dai, Y. M.; Harrington, P. D. Talanta 2011, 83 (5), 1401−1409. (4) Wold, S. Pattern Recognit. 1976, 8 (3), 127−139. (5) (a) Wang, Z. F. H. P. B. Feature selection of gas chromatography/ mass spectrometry chemical profiles of basil plants using a bootstrapped fuzzy rule-building expert system. Anal. Bioanal. Chem. 2013, in press. (b) Wang, Z. F.; Chen, P.; Yu, L. L.; Harrington, P. D. Anal. Chem. 2013, 85 (5), 2945−2953. (c) Harnly, J. M. C. P.; Harrington, P. B. J. AOAC Int. 2013, in press. (6) Branden, K. V.; Hubert, M. Chemom. Intell. Lab. Syst. 2005, 79 (1− 2), 10−21. (7) Jackson, J. E.; Mudholkar, G. S. Technometrics 1979, 21 (3), 341− 349. (8) (a) Wise, B. M.; Gallagher, N. B. J. Process Control 1996, 6 (6), 329−348. (b) Brereton, R. Chemometrics for Pattern Recognition; John Wiley & Sons: Chichester, U. K., 2009; p 522. (9) (a) Hopke, P. K.; Massart, D. L. Chemom. Intell. Lab. Syst. 1993, 19 (1), 35−41. (b) Forina, M.; Armanino, C. Ann. Chim. (Rome, Italy) 1982, 72 (3−4), 127−141. (c) Forina, M.; Tiscornia, E. Ann. Chim. (Rome, Italy) 1982, 72 (3−4), 143−155. (10) Harrington, P. D. B. TrAC, Trends Anal. Chem. 2006, 25 (11), 1112−1124.

4892

dx.doi.org/10.1021/ac5001543 | Anal. Chem. 2014, 86, 4883−4892