81
Factor Analysis of Mass Spectra from Partially Resolved Chromatographic Peaks Using Simulated Data Hugh B. Woodruff Merck Sharp d Dohme Research LaboratorJes, P.O. Box 2000, Rahway, New Jersey 07065
P. C. Tway and L. J. Cline Love' Lbpartment of Chemistry, Seton Hall Univershy, South Orange, New Jersey 07079
The use of factor analysis on ma88 spectroscopk data to determine the number of components under a chromatographk peak was critkally evaluated on shrulated Gaussian gas chromatographk-tn~spectroscopk (GC-MS) results. In the absence of noJse, except for cotnfnder rOUKkff errors, the SmR ol detectkn ol Bn tnpumv under the mah component peak was approximately 1%, depenans on the resoMbn, the peek wklth, and the simlartties of the fragmentation pattema When nolse was added to the data matrix, the U d t of detedkn haeased to 3-10%. Twocomponent mbdwes, whkh could not be resohred vhalty, were resdved e&y by factor anatysis. Crtterla for the evaluation of the elgenvalues, Imbedded error, and lndkator functions are proposed to elhinate 8 ~ of 8the subJeclhrenesspfevkudy present In evaluatlng factor analysis resutts.
Although chromatographic methods, especially those using capillary columns, offer high selectivity and resolution, it is often difficult to separate completely two or more components in a complex mixture or to determine conclusively if a peak is due to a single component. Better results can sometimes be obtained by changing the column, the solvent system, or the temperature, but the presence of a single peak in several types of chromatographic systems does not prove that the sample is a single component. Additional information about such possible multicomponent peaks may be obtained if the mass spectra at intervals across the peak are recorded and the data are examined by factor analysis. Factor analysis is a mathematical technique for the interpretation of many types of data. It was originally used in the social sciences ( I ) and has more recently been used by scientists in the fields of chromatography (21, spectroscopy (31, biological activity (4), and mass spectrometry (5).Halket and Reed (6) applied factor analysis to the mass spectra of mixtures fractionated within the mass spectrometer, while Ritter e t al. (7)used factor analysis to calculate the number of components in the mass spectra of gas mixtures obtained by distillation. Halket (8, 9) has applied both matrix rank analysis and principal component analysis to repetitively scanned gas chromatographic-mass spectroscopic (GC-MS) data to demonstrate the utility of these methods. Previous work from this laboratory has shown the applicability of factor analysis to typical GC-MSand solid probe samples on a routine basis (IO). A totally automated system for factor analysis of mass spectroscopic data was implemented, eliminating any intermediate batch processing and permitting a sample of questionable purity to be factor analyzed by a trained technician in less than 5 min. Although the practical applicability of factor analysis to spectroscopic data has been shown, interpretation of the results to this point has been somewhat subjective. In order 0003-2700/81/0353-0081$01.00/0
to make factor analysis truly practical for routine use, it was necessary (i) to determine the limit of detection of a minor impurity under a chromatographic peak as a function of resolution and (ii) to develop objective criteria for the evaluation of factor analysis results. Both endeavors were successfully accomplished and are reported here. In the present work a factor analysis technique known as principal component analysis was used (11) to obtain eigenvalue data from which the number of components in the mixture was determined. The eigenvalues are associated with a set of eigenvectors which span the multidimensional space occupied by the spectroscopic data. The magnitude of an eigenvalue indicates the amount of space spanned by the associated eigenvector. The object of the analysis is to determine the minimum number of eigenvectors required to reproduce the original data within experimental error. This determination is based on observing a significant drop in the magnitude of the eigenvalues; the number of large eigenvalues is an indication of the number of components in the mixture. h4alinowski has developed an error theory for factor analysis (121, which can be used as an aid to interpret results. This theory was developed to help overcome the subjectivity of the evaluation of the eigenvalues (EV). In this theory, three different functions are calculated which can be used to predict the number of factors spanning the data set. The real error (RE) is the difference between the pure, error-free data and the actual experimental data. If one has a good estimate of the experimental error, then by comparing the calculated RE values with this estimate, the correct number of significant eigenvectors can be determined. Theoretically, the imbedded error value (IE) should reach a minimum plateau when the correct number of factors is used. Additional factors are merely redundant and do not remove any further error. The indicator function (IND) has been found empirically to reach a minimum when the correct number of factors is employed. Since accurate use of RE is dependent on knowledge of the experimental error and such information is not always available, I E and IND are used to predict the number of factors and RE is used in a corroborative fashion. The three functions have been evaluated and guidelines have been set up for their use. EXPERIMENTAL SECTION The limit of detection and criteria for evaluation of the factor analysis results were determined by using simulated GC-MS data Four different typea of data were evaluated: (i) the progressively convergent overlap of two or more components into a single chromatographic peak as a function of resolution using pure Gaussian data with no noise; (ii) the overlap of two components of pure Gaussian data such that the impurity is at all times completely under the main component peak; (iii) the same as (i) except that random noise was added to all mass spectral data; (iv) the same as (ii) except that random noise was added to all mass spectral data. @ 1980 Amecicen chemlcel Sockty
82
ANALYTICAL CHEMISTRY, VOL. 53, NO. 1, JANUARY 1981
Table I. Eigenvalue Results for Two-Component Mixturesa resolution 0.048 resolution impurity level Ab BC Ad 10% 2.9 x 107 2.5 x 10' 2.8 x 10' 6.1 x 103 6.2 x 10' 1.5 x 104 G.G 10.4 3.9 1%
0.1%
5.7 2.6 x 107 69.9 8.1 3.7 2.5 x 107 8.9 6.4 3.7
2.1 2.2 x 107 11.5 3.6 1.7 2.1 x 10' 3.5 2.4 1.7
4.9 2.6 x 107 1.6 x 103 8.1 2.9 2.5 X 10' 17.9 7.5 4.6
0.240
Be 2.4 x 1.5 x 7.5 4.5 2.1 x 1.6 x 3.2 1.8 2.1 x 5.1 4.7 2.0
10'
lo4 107 10' 10'
resolution 0.768 Af BE 2.6 x 107 2.1 x 107 3.1 x 10' 3.0 x los 13.5 7.0 2.6 x 107 3.1 x 103 15.3 3.7 2.5 x 107 28.6 11.3 3.7
9.3 7.5 2.1 x 107 3.2 x 10' 7.4 5.4 2.1 x 10' 5.3 3.9 2.6
a GC peak for each component is of equal width and the peaks are partially separated. Matrix size 13 X 22. size 6 x 22. Matrix size 13 x 27. e Matrix size 6 X 27. f Matrix size 13 X 37. Matrix size 6 X 37.
The mass spectral data used in these simulations were obtained from the National Bureau of Standards Mass Spectral Library, which was included with the software of the INCOS Data System from the Finnigan Corp. The four compounds used in this study were the 0-methyloxime of 3,17,21-tris(trimethylsilyloxy)pregnan-20-one and 3,17,20,21-tetrakis(trimethylsilyloxy)pregnan11-one (series A) and the methyl ester of benzoic acid and the hydrazide of benzoic acid (series B). The two series A compounds have similar ions a t the low mass end of their spectra but very different fragmentation patterns a t the higher massea. The base peak and major ions in series B compounds are identical, but they have different relative intensities. The series A compounds were examined to determine the sensitivity of factor analysis to the presence of a small amount of a second component which is similar but does have a different fragmentation pattern. The series B compounds were studied to test the ability of factor analysis to resolve isomers or compounds with very similar fragmentation patterns that differ merely in peak intensities. The presence of two such components in a single peak is very difficult to determine by visual inspection of the individual mass spectra. GC-MS peaks were simulated for each component by using a simple FORTRAN IV program. With the one exception described below, all the peaks were assumed to be Gaussian with a peak width of 4a equal to 21 mass spectral scans. The assumption of a Gaussian peak shape is valid, especially for capillary gas chromatographic data. The main component was p i t i o n e d with a peak maximum a t scan 11. The concentration of the minor component was varied from 0.1% to IO%, and its peak maximum pasition was varied from scan 12 to scan 27. Thus, the resolution of the two components was varied from nearly complete overlap, 0.048,to a situation approaching complete resolution, 0.768. Resolution between two peaks is defined as the difference between the peak maxima divided by the average peak width. Electron impact mass spectra are well-known to be linearly additive a t low ion source pressures, so the mass spectrum for each scan may be calculated by simply summing the relative contributions of all components. The typical noise in GC-MS data was evaluated by running the same sample 10 times and examining the variation in peak intensity of the ions. By normalization of the data such that the base peak has an intensity of loo0 units, the noise was found to be random, independent of mass position or intensity, and ranging in magnitude between -10 and +10 units. Thus the base peak has an error of &I% while a mass peak with an intensity of 200 units has a *5% error. The noise is of the same magnitude as that used by Benz in a recent study using simulated mass spectral data (13). Although the amount of noise will vary in different systems, the values used in this study closely approximate the average noise encountered in a well-tuned GC-MS instrument. In this work, a normally distributed random number generator with a mean of 0 and standard deviation of 5 was used to introduce noise to the data. The random noise was added to each mass intensity in each scan. A simulated CC peak containing the major and minor components and in some c88e9 random noise was constructed by using simple computer programs for the multi-
Matrix
plication and addition of matrices. A spectral data matrix from the partially resolved chromate graphic peak was formed with the rows corresponding to m / e values and the columns to scan number. The array contained the ion intensity of each m / e value at each SQUL Thus,one column of data was equivalent to one complete mass spectrum,while each row of data was essentially a reconstructed ion chromatogram of one m / e value as a function of scan number. In series A, the eight moat intense maeees of each component were chosen to form the data matrix, while in series B the six moat intense masses were Used. All of the factor analysis work was done by using a covariance rather than a correlation matrix. Based on the definition of a covariance matrix, each data point is weighted in proportion to its absolute value. Large data points are given more statistical importance than are smaller values. The use of the covariance matrix is recommended when all of the data points have approximately the eame absolute error, as is true in mass spectroscopic data. A correlation matrix is used if one wishes to give equal weight to each column of data which is accomplished by the normalization of each column of data. The use of the correlation matrix is recommended when the experimental error is directly proportional to the magnitude of the measurement. Employing a correlation matrix for GC-MSdata would place the same importanceon a t the edge of a peak where the total number of ions is very low as it would place on the scans in the center of the peak and could cause misinterpretation of the results. All programs were written in FORTRAN and run on an IBM 370/168 located a t Merck & Co., Inc., in Rahway, NJ.
RESULTS AND DISCUSSION Factor analysis was performed on mass spectral data from partially resolved GC peaks and from totally unresolved GC data. A sample of the results for a partially resolved case is shown in Table I. The GC data of each component were 21 scans wide. The major component peak remained stationary with a peak maximum at scan 11; the minor impurity peak was shifted 80 t h a t it covered scans 2-22 (maximum at scan 12) at one extreme and scans 17-37 (maximum at scan 27) at the other extreme. Three different relative concentrations ranging from a 10% to a 0.1% impurity were investigated for both series A and series B compounds. Examination of the wries A results for the W10 mixturee shows that fador analysis definitely succeeds at all resolutions in detecting two components. For example, with a resolution of 0.048, a sharp drop in the eigenvalues occura between the second and third factors (6.1 x 103 dropping to 6.6) which indicates that the sample contains two components. The fmt eigenvalue accounts for 99.98% of the variance in the data, while the second eigenvalue accounts for 0.02%. As the resolution is increased the second eigenvalue becomes larger and contributes significantly more to the reproduction of the
ANALYTICAL CHEMISTRY, VOL. 53, NO. 1, JANUARY 1981
83
Table 11. Eigenvalue Results for Two-Component Mixtures" resolution 0.000
impurity level 10%
1%
0.1%
resolution 0.188
BC
Ab 2.7 x 107 3.2 x 10' 8.2 5.9 2.6 x 107 3.4 x 10' 7.6 4.7 2.5 x 107 10.3 5.2 3.7
2.3 x 3.2 x 4.7 3.3 2.1 x 34.8 3.8 2.2 2.1 x 3.7 1.5 0.4
107 103 107
107
2.7 x 5.8 x 7.0 4.8 2.5 x 6.2 X 6.8 4.5 2.5 x 7.9 6.2 4.4
resolution 0.313
BC
Ab
107 104 107 10' 107
2.3 x 5.9 x 5.2 1.5 2.1 x 65.2 3.4 2.2 2.1 x 2.9 2.0 1.1
107 103 107
107
Ab 2.7 x 107 9.2 x lo4 8.5 4.9 2.5 x 107 9.7 x 10' 8.1 4.4 2.5 x 107 10.7 7.1 4.2
BC 2.3 x 9.4 x 4.0 2.5 2.1 x 1.0 x 3.2 2.7 2.1 x 5.5 3.0 0.7
107 103 107 10' 107
" The impurity GC peak is narrower than the main component peak and is totally submerged under the main component
peak.
Matrix size 13 x 21. I '
0
" '
Matrix size 6 x 21.
" "
10
'
20
I
" " I
'
30
Scan Number
Figwe 1. Reconstructed ion chromatogams: (1) one we CaTlparnd frm series A; (2)thesamecompoud fromseries A withtheaddltkn of a 4 % Impurity at a resolution of 0.048; (3)the same as w e 2, except that the resolution between the components Is 0.809.
data matrix using two factors. Factor analysis also succeeds in predicting two components at the 1% impurity level. The moet difficult case,a resolution of 0.048, has eigenvalues of 2.6 X lo', 69.9, 8.1, and 3.7 or eigenvalue ratios (EV,/EV2, EV2/EV3,etc.) of 3.7 X l e , 8.6, and 2.2. Since an eigenvalue drop of a factor of 8.6 is probably significant, the data indicate two components. At greater resolutions a two-component mixture with a 1% impurity can easily be identified. T o gain a fuller appreciation of the effects such low levels of an impurity have on reconstructed ion chromatograms, see Figure 1. The difference between curve 1, a pure series A compound, and curve 2, the same compound with the addition of a 4% impurity and a resolution of 0.048, is virtually impossible to discern visually. Factor analysis however easily predicts two components; the eigenvalue drop between factors two and three is about 100. Curve 3 shows the same level of impurity with a resolution of 0.81. Since the data summarized in Table I contain no noise, factor analysis might be expected to perform perfectly a t the 0.1% impurity level as well. However, the results indicate imperfect performance, with the eigenvalues showing no significant drop between factors two and three, thereby p r e dicting a single component at all resolutions. The reason for the failure of factor analysis a t this low impurity level is that the contribution of the impurity is indistinguishable from the contribution due to computer roundoff error. Thus, for an ideal experiment in which no background noise is present and the only error is due to computer roundoff, factor analysis can correctly predict the presence of a 1% or greater impurity, even if the two components are nearly completely overlapped. Similar results were obtained for series B. For all levels of resolution tested, factor analysis correctly predicts that the
90:lO mixture contains two components. A 1% impurity is successfully perceived for all except the smallest resolution, 0.048, where only one component is predicted. As occurred with series A compounds, a 0.1 % impurity cannot be found by factor analysis in the presence of computer roundoff error. So for two compounds having identical mass spectral ions but with different relative intensities, factor analysis can resolve mixtures of the two a t impurity levels as low as 1% in the absence of noise. A more difficult and perhaps more interesting problem is presented when the impurity peak is completely buried in the major component peak. Simulation of this problem was achieved by reducing the width of the impurity peak to 11 scans. The narrower impurity peak was shifted from the left portion of the major peak (impurity peak maximum a t scan 6) to directly under the major peak (maximum a t scan 11). A portion of the results is presented in Table 11. The size of the matrix for series A was 13 by 21 and for series B was 6 by 21. Interpretation of the results, especially for series B, was more difficult than the interpretation when the impurity was partially separated from the main component (Table I). Factor analysis still works well to detect two components in a a 1 0 mixture for both series A and B and at the 1% impurity level in series A. For series B mixtures a t the 1% impurity level, the eigenvalue drop between factors two and three varies from 10 to 30, making the interpretation more difficult. From experience a decrease in the eigenvalues on the order of magnitude of 10 is generally significant. On the basis of previous experience then, one would probably suggest a two-component mixture for all the series B samples at the 1% level. At the 0.1 % level the drop in eigenvalues between the second and third factors is less than 3, clearly indicating a one-component sample. Only the eigenvalues were evaluated in this work as the criterion to determine the number of componenta in a mixture. The error theory functions cannot be used, since there should be no error in the data matrix, although some error was introduced by computer roundoff. The RE function which should have been zero a t the correct number of Components was always approximately 0.2-0.4. Since the nature of the computer roundoff error is not uniform, the error theory cannot be used. The difficulties encountered in evaluating some of the data in Tables I and I1 highlight the need to reduce the subjectiveness involved in determining what drops in the eigenvalues are significant. For development of such a criterion, all of the results from this investigation were tabulated and carefully studied. The magnitude of the eigenvalue drop (EVI/EV2, etc.) from one, two, three, or four eigenvectors spanning the
84
ANALYTICAL CHEMISTRY, VOL. 53, NO. 1, JANUARY 1981
Table 111. Success Rate for Determining the Correct Number of Components by Factor Analysis % correct
EVb concn
A
B
predictione IEC IND~ A B A B
Partially Separated Impurity Peak-No Noise 9O:lO 100 100 99:l 100 YO 99.9:O.l 0 0 Partially Separated Impurity Peak-Noise Added 50:50 100 100 100 100 100 100 9O:lO 90 90 100 90 100 90 92:8 82 87 83 94:6 69 80 73 95:5 89 90 92 0 89 47 90 7 96:4 85 97:3 79 80 90 98:2 45 0 65 0 81 0 99:l 0 0 0 0 10 0 Totally Submerged Impurity Peak-No Noise 9O:lO 100 100 96:4 100 99:1 100 100 99.9:0.1 0 0 Totally Submerged Impurity Peak-Noise Added 50:50 100 100 100 100 100 100 0 100 75 100 75 9O:lO 100 96:4 0 0 io0 0 100 0 99:1 0 0 0 0 0 0 a Data are based on 10-100 predictions per case. EV = eigenvalue. IE = imbedded error. IND = indicator. data set was calculated, and an optimum was determined empirically. The best cutoff value was found to be 7.0. Thia criterion means that if eigenvalue 2 is 7 times larger than eigenvalue 3, it is considered significant. By use of this criterion, the number of components is predicted by finding the last Occurrence of a drop between factors that is greater than or equal to a factor of 7.0. For example, if the eigenvalues for an analysis were 20, 2.1, 1.1, 1 X lo4, and 9.5 X lo”, the eigenvalues predict three components. Even though the drop between factors 2 and 3 is less than the cutoff, the drop from 1.1 to 1 X lo4 is significant, so the last Occurrence of a drop larger than the cutoff indicates three significant factors. Since noise and background are an inherent part of mass spectroscopic data, it was important to determine the level of detection of factor analysis in multicomponent systems containing noise. Normally distributed random noise was added as described in the Experimental Section to the mass spectral data of partially separated and completely submerged two-component gas chromatographic peaks. Factor analysis was performed on partially separated gas chromatographic peaks with impurity levels of 50, 10, 8, 6, 5,4,3,2, and 1%. These results were evaluated on the basis of the eigenvalue criterion and on the error theory IE and IND functions and are summarized in Table 111. T h e partially separated impurity peak results are perfect for the 5050 mixtures and show only one failure for the W 1 0 mixtures, namely, the case where the main component and impurity peak maxima are separated by only one scan (resolution = 0.048). Even a t this minimum resolution level, the I E and IND functions reach a minimum a t two factors for series A, resulting in correct predictions. For series A, the introduction of noise to the spectroscopic data has increased the level of detection to 2-4% depending
on the resolution and the exact amount of noise added. Without noise the limit of detection is approximately 1%, The limit of detection of the B series has increased from 1-2% to 54%. A noise study was also done on gas chromatographic peak^ in which the impurity was completely submerged under the main component. Impurity levels of 50,10,4, and 1% were studied, and a summary of the results is given in Table 111. With the introduction of noise the limit of detection has increased from 1% to approximately 4-5% for series A and approximately 10% for series B. Results of the simulation study demonstrate that the limit of detection is established by the noise in the system not by the factor analysis method. When factor analysis fails, it predicta too few components because the low-level impurities are lost in the background noise. In this work the average level of detection was approximately 6% which was also roughly the noise level. Therefore, if one wants to find a 2% impurity in the sample, factor analysis will possibly find it, if the signal from the impurity can be made greater than the general background noise by injection of a larger sample or by changing other experimental conditions. The ability of factor analysis to resolve low-level impurities above the noise level gives the experimentalist an apprcxlch for obtaining more information from the mass spectral data.
CONCLUSIONS The results of this simulation study demonstrate that factor analysis is a viable technique for resolving mass spectroscopic data containing low-level impurities. In the absence of noise, a 1% impurity can be seen, even if it is completely submerged under the main peak. When noise is added, the limit of detection incream to 2 4 % for partially separated peak^ and to 4-10% for a completely buried impurity peak. Predictions based on the IE and IND functions were superior to eigenvalue predictions for these data which included idealized uniform noise. Since the real world often contains less uniformly distributed noise, the IE and IND functions perform somewhat poorer (IO) and for that reason a cutoff of 7 for the drop between successive eigenvalues to determine the correct number of significant factors is proposed.
ACKNOWLEDGMENT The authors thank R. F. Hirsch and E. R. Malinowski for helpful discussions and ideas.
LITERATURE CITED (1) Runmd, R. J. “Applbd Factor Analysb”; Nathwestem W w w Rm: Evanston, IL, 1970. (2) Wet,R. 8.; Howw. D. G. J . chromatog. 1075. 115, 139-151. (3) C h Love. L. J.; Cala Tway, P.; Upton, L. M. Anal. chem.1060, 52, 311-314. (4) Wehec, M. L.; W d n a , P. H. J . W . chem.1073, 16, 655-661. (5) Rozett, R. W.; Peterson, E. M. Anal. chem.1076, 47, 1301-1308. (6) Haket, J. McK.; Reed, R. I. Org. Mess spscbom. 1076, 10, 808-812. (7) RHter, G. L.; Lowry, S. R.; Isenhar. T. L; Whs. C . L. Anal. C b m . 1078. 48, 591-595. (8) W e t , J. McK. J . C k m a t o g . 1070, 175, 229-241. (9) Haket, J. McK. “Reccmt Devebpmenh In Chrometogaphy and Ebctroohoreds”:Frbdo. A.. Renor. L.. E L . : Etsevbr: Antotedam. 1979; 327-340. (10) Tway, P. C.; Csne Love, L J.; Woochff, H. B. Anal. CNn. Acta 1060, 117 45-52 . .. , .- - -. (11) Howery, D. 0. “Statt3tics”; Htsch. R. F., Ed.: Frankh Institute Resa: PA, 1978; Chapter 7. (12) Maynowski, E. R. Anal. Chem. 1077, 49, 808-617. (13) Benz, W. AM/. Chem. 1080, 52. 248-252.
-
m,
RECEIVED for review April 14,1980. Accepted July 18,1980. P.C.T. thanks Merck & Co.,Inc., for partial financial support.