Factor Analysis of the Mass Spectra of M xti res Garry L. Ritter, Stephen R. Lowry, and Thomas L. Isenhour* Department of Chemistry, University of North Carolina, Chapel Hill, N.C. 275 14
Charles
L. Wilkins
Department of Chemistry, University of Nebraska, Lincoln, Neb. 68508
A detailed method for the determination of the number of components in a series of related mixtures is described. Using principal component analysis of the mass spectra of a series of such mixtures, it is possible to determine the number of components. It is shown, with several examples, that the algorithm described has potential as a highly useful tool for the interpretation of combined chromatography-mass spectrometric data.
Principal component analysis, which is one type of factor analysis (FA), is a well-known statistical technique for the interpretation of data. Although its earliest and most extensive applications were in the psychological and social sciences ( 1 - 4 ) , it has more recently been applied to a wide variety of chemical problems, as noted by Rozett and Petersen ( 5 ) . Of particular relevance to the present work are the early examples of the use of FA for analysis of optical spectra (6-10) and the more recent studies of Rogers and co-workers in the use of the method for the interpretation of gas chromatographic data (11, 12). As both this and the optical spectra work showed, principal component analysis is a feasible approach t o determining the number of components in unknown mixtures. One of the most promising possibilities for utilizing FA is its application to combined gas chromatographic-mass spectrometric data, as first suggested by Davis et al. (12). More generally, it should, be equally applicable to any type of combined partition chromatography-mass spectrometric data. In the work reported here, a general and quantitative method for reliably determining the number of components in unknown mixtures has been developed and tested.
MASS SPECTRA O F MIXTURES T o determine the unknown number of components in a mixture, the mass spectra of various compositions of the components are measured and analyzed. These mixtures of different compositions may be obtained, for example, by sampling an unresolved gas chromatographic peak. The analysis is possible because the mass spectrum of a mixture is a linear combination of the mass spectra of the pure components. As an example, consider the simplest case, a twocomponent mixture of substances arbitrarily designated C and D. The mass spectrum of C or D may be represented by a vector with a number of dimensions equal to the number of mle positions sampled. The magnitude for a given dimension will be the intensity of the signal a t the given mass position. The spectrum of C could be designated A ( C ) and the spectrum of D, A(D). If, in a mixture, f is the mole fraction of C, then (1 - f ) is the mole fraction of D. The mass spectrum of the mixture is denoted by A(fC [l f]D). Since the mass spectra are linear, the spectrum of the mixture is the weighted sum of A(C) and A(D) as is shown in Equation 1.
+
A ( f C + [1 - f ] D )= fA(C) + [I - f]A(D)
(1)
To be more specific, suppose that three corresponding mass positions are selected for the spectra of the pure materials. For component C, the intensity a t the first mass position is 10, a t the second mass position 20, and a t the third 15. For component D, the intensities a t these mass positions are 20, 0, and 5. This yields the mass spectrum vectors A(C) = (10,20, 15)
(2)
A ( D ) = ( 2 0 , 0 ,5)
(3)
Then a mixture of 40 mol % C and 60 mol % D gives A(0.4C
+ 0.6D) = 0.4(10,20, 15) + 0.6(20,0, 5) = ( 1 6 , 8 , 9 ) (4)
The mass spectrum of the mixture is given by the intensities (16, 8, 9). Therefore, given the mass spectrum of each pure component, the spectrum of any mixture of the components may be calculated. However, in this application, the mass spectra of a series of mixtures are given and the problem is to discover how many pure components are necessary to reconstruct the mass spectra of the mixtures. A mass spectral system matrix is constructed which contains one row per mass position and one column per spectrum (or different mixture). The rank of this matrix determines the number of independent factors which must be used to produce the system matrix. This is the number of components in the mixture ( 7 ) .In a real system, a system which contains random error, the problem becomes only slightly more difficult. In this case, the number of independent factors which reproduce the data within the allowed error must be found. Conditions which may lead to an incorrect number of components are: 1) some component is not represented in any of the mass positions investigated; or 2) one component’s spectrum is a linear combination of the others. The first condition can easily be avoided by careful attention to the number and identity of the mass positions used. In fact, the existence of this condition can be used to advantage to remove background (vide infra). The second condition is unlikely, and the chances of it occurring can be minimized by avoiding the use of small numbers of intensities in the analysis. Implicit in the discussion thus far has been the fact that it is possible to choose how many and which m/e positions in the spectra one uses. In other words, the entire mass spectrum need not be used. In the present work, using two or three component mixtures to test the method, 15 to 20 mle positions proved satisfactory.
FACTOR ANALYSIS As mentioned above, in order to begin the analysis procedure, the mass spectra of a series of unknown mixtures of the same components are first represented as a spectral system matrix containing d rows of mass positions and s columns of spectra. The values for d and s should be chosen so that sufficient mass positions are represented to ANALYTICAL CHEMISTRY, VOL. 48, NO. 3, MARCH 1976
591
avoid conditions 1 and 2 discussed previously and so that the number of spectra equal or exceed the suspected number of components. The method used here is Q analysis, as explained by Rozett and Petersen ( 5 ) .Each of the columns is normalized to have a zero sum, generating a new matrix, A , which also contains d rows and s columns (Equations 5-7).
The variance, which is independent of the intensity value recorded, was used to estimate the standard deviation in the intensity value. When an intensity is measured, the standard deviation from Equation 11 is used to estimate its uncertainty. Each term in the covariance matrix may be written from the normalized intensities (Equation 12). d clj
a,, = a,,
- 8,
(6)
A = [a,,]
(7)
Here a,, is the intensity of mass position i for spectrum j and a,, is the corresponding normalized intensity. The covariance matrix, C, is formed by premultiplying A by its transpose and is a symmetric matrix containing “s” rows and “s” columns. c=A ~ A (8) The problem is now to find the number of orthogonal factors necessary to cause this covariance. The number of such factors is easily determined by diagonalizing the covariance matrix and obtaining the eigenvalues and eigenvectors. These vectors are chosen to satisfy the relation
ex = AX
(9)
where X is a positive number. Then
X T C X = X1’XX = XXTX = X
=
1”1510 ( x
2.1)2dx
2.05
= 1/1200 592
akiak]
(12)
k=l
The variance about the mean c,, is calculated from the variance of the means of a h , and a h , estimated from Equation 11.
The standard deviation in the covariance element given by Equation 14.
cij
is
This information for each ci; term may be placed into an error matrix e ( C ) so that
(10)
The vectors are then ordered such that X 1 1 A2 I . . . 1 0. Finally, it is necessary to determine how many of the eigenvalues (A,) and (corresponding) eigenvectors (X,) are required to sufficiently describe the system and, hence, how many components the system contains. In previous work, specifically that done by Hugus and El-Awady ( 9 ) , methods have been suggested for determining the number of components. The ultimate goal is to reproduce the original data within the experimental uncertainty. However, when these methods are used, the uncertainty is expressed as a constant relative error. In the present work, the mass spectra were encoded by hand from the mass spectral graphs. The dominant error was assumed to result from rounding the intensity data to the nearest 0.1 unit. Given this assumption if the “true intensity value” were to lie between 2.05 and 2.15 units, the reported value would be rounded to 2.1. Therefore, the uncertainty is absolute and some alternate criterion must be found. This discussion uses the number of eigenvectors necessary to reproduce the covariance matrix within error. Consider the intensity values which are reported to be 2.1 units. By assumption, the true intensity lies between 2.05 and 2.15 units. Similarly it is equally likely that any of these values might appear. This describes a probability distribution for the intensities reported as 2.1. The distribution is a square wave which ranges between 2.05 and 2.15. Since all true values between 2.05 and 2.15 units are reported as 2.1, the area or total probability beneath the square wave is one. Using this fact, the amplitude of the wave may be shown to be ten (10).The mean value of the distribution is 2.1, the reported intensity, and the variance (a2)is caiculated from Equation 11. u;,1
=
(11)
ANALYTICAL CHEMISTRY, VOL. 48, NO. 3, MARCH 1976
and employed in an iterative procedure to find the number of components. This analysis is described in Equations 16-19. The covariance matrix may be reconstructed from the eigenvalues ( X i ) and the eigenvectors ( X I )as shown in Equation 16.
c = c XjXiXiT S
r=l
C may be approximated by h of the eigenvectors and this approximation denoted by Ch. Ch =
h
E X,X,X,T,
h = 1, 2 , . . . s
r=1
(17)
In this notation C s = C. Using this formalism, the individual covariance element (c;) is calculated from the relation h
c; =
E
XkXTLxg
(18)
k=I
where x x B is the Bth element of the Ath row vector. That is, multiply corresponding elements of column i and column J of the eigenvector matrix by the corresponding eigenvalues and sum over all eigenvalues included. The number of residuals are the number of cy for which
The number of components is the smallest value of h for which all covariance terms lie within the allowed error, or for which there are zero residuals.
EXPERIMENTAL Mixtures were prepared from high purity samples of a number of materials and were purposely chosen from representative compounds with mass spectra which are very similar, in order to test the method on the most demanding types of mixtures. Mass spectra were obtained on either a Hitachi RMU-6D or RMU-7 mass spectrometer and the resulting spectra digitized manually. All computations were carried out using an IBM 360/75 computer.
Table I. Digitized Intensity Values-Cyclohexane/ Cyclohexene Mixtures
a
% Cyclohexane m le
27 28 29 39 40 41 42 43 51 53 54 55 56 67 68 69 79 81
82 84
20%
40%
60%
80%
2.3 1.2
3.2 1.3
3.4
1.1
1.1
3.9 .7 8.6 3.5 1.6 .5 .9 3.6 5.1 14.2 4.4 .5 4.1 .4 .5 1.5 10.5
5.8 1.o 10.5 3.7 1.7
2.1 .7 .5 4.9 .6 6.0
1.3 1.1
6.8
1.o 10.5
3.1 1.4 1.5 1.7 11.3 4.5 11.4 16.0
1.1
1.7 8.0
5.4 14.4 11.1 .8
1.1
.4
1.o
1.7 10.1
1.1 3.3
4.3 .8
1.o
1.2 1.6 6.1 8.5
1.0
4.1 11.8
b
1.9 3.6 15.1 .8 1.o 1.4 5.4 2.4
Table 11. Eigenvalues and Eigenvectors for Cyclohexane/Cyclohexene Mixtures h
Eigenvalue
1
1035.8 222.7 0.7 0.2
2 3
4
Eigenvector
0.411 -0.633 0.579 -0.309
0.566 -0.275 -0.768 -0.122
0.594 0.245 0.235 0.729
0.397 0.681 0.143 -0.599
. , -- ..
- _.
Table 111. Digitized Intensity Values-Cyclohexane/Hexane Mixtures % Cyclohexane %/e
27 28 29 39 40
41 42 43 44 54 55 56 57 69 83 84
85 86
100%
90%
1.8
1.6
2.1 1.3 2.5 0.7 7.1 3.5 2.2 0.2
1.8
0.8
4.6 13.(i
1.2 3.8 0.8
10.7 0.9 0.1
1.5 2.0 0.5 6.4 3.4 2.5 0.2 0.6 3.7 11.6 2.3 3.4 0.6 8.2 0.8 0.4
80%
50%
20%
10%
0%
2.4 4.1 2.0 2.2 0.6
2.6
2.8 6.8 4.4 2.0 0.3 9.1 4.7 7.4 0.3 0.3 2.4 9.9 9.6 1.6 0.4 3.6 0.5 2.4
2.7 2.8 4.2 2.0 0.4
2.8 2.0 5.1 1.6 0.4
8.0
3.7 3.3 0.2 0.6
4.3 12.6 3.2 3.4 0.6 8.6 0.8 0.6
1.8
3.3 2.5 0.5 9.3 4.6 6.2 0.2 0.5 3.5 12.5 7.3 2.7 0.5 7.3 0.6 1.6
8.8
8.8
4.6 7.9 0.2 0.2
4.7 8.6 0.3
1.1
0.9 6.8 12.2 0.2
8.4 11.0 1.0 0.4 2.0 0.3
2.5
d
0.1
0.1 0.1 0.1
2.8
Two of the sets of mixtures used in the experiments were prepared and mass spectra determined by a chemist not involved in the analysis. T h e data thus obtained were submitted to us as undigitized mass spectra for factor analysis. Four sets of mixtures, known and unknown, were examined. These were: (1)cyclohexane/cyclohexene; ( 2 ) hexane/cyclohexane; (3) heptane/octane; (4) unknown xylenes. Sets 2 and 3 were prepared by us, the others elsewhere, as described above.
._
Figure I.Partial mass spectra of the mixtures used in the example analysis (a) 80% by volume cyclohexane, () 60% cyclohexane, ( c ) 40% cyclohexane, (d)20% cyclohexane
cyclohexane. Figure 1 shows partial mass spectra of these mixtures, including only those peaks which were used in the analysis. For convenience, the digitized intensity values are listed in Table I. Twenty mle positions were used and, after normalization of the resulting mass spectral system matrix (Equations 5-7), followed by premultiplication of the normalized matrix by its transpose (Equation 8) gave the following covariance matrix:
R E S U L T S AND DISCUSSION I t is useful to consider one example in detail to fully clarify the procedure. Four mixtures of cyclohexane and cyclohexene were the source of mass spectra. These mixtures contained, respectively, 80%, 60%, 40%, and 20% by volume
c =
[
264.7 279.4 218.7 73.2
279.4 348.7 333.1 190.8
218.7 333.1 379.6 281.5
73.21 190.8 281.5 266.5
ANALYTICAL CHEMISTRY, VOL. 48, NO. 3, MARCH 1976
(20)
593
175.2
Table IV. Digitized Intensity Values-HeptaneIOctane Mixtures
.. , . L..
=
ci
% Heotane
m le
27 29 30 39 40 41 42 43 44 55 56 57 58 70 71 72
90%
75%
50%
25%
10%
1.0
0.5 0.6 0.6 0.7
0.5
0.9 0.9 0.8
0.4 0.8 0.8
0.7 0.4 1.4 0.9
0.5 0.2
0.1
0.1
0.1
0.1
0.1
1.9 0.6 4.0 1.4 0.9
1.2
0.9
0.4
0.3
2.5
1.9 1.3 0.6 0.5
1.5 0.8 0.3 0.2
1.o
2.1 0.1 1.o
1.7 0.1
0.7
0.8
0.3 0.5
0.1
0.1
C
D
E
1.7 4.6 1.9 4.1
2.3 5.5 2.3 5.2
1.5 2.7 3.3 5.1 2.5 3.0 33.2 3.0 2.2 6.0 5.4 1.6
2.8 3.5 6.2 3.2 3.6 41.1 3.6 2.6 10.8 6.6 2 .o
2.3 5.4 2.4 5.3 1.9 2.7 3.5 6.4
1.3
0.7 0.7 1.5
0.5
0.4
1.o 0.1
1.1 0.1 0.4
0.1
0.6 1.1 0.1
27 39 50 51 52 63 65 77 78 79 91 92 103 105 106 107
1.8
5.8 2.7 5.6 2.8 3.3 4.6 6.1 3.8 3.3 41.4 3.7 2.1 1.4 18.5 1.5
B
1.6 4.8 1.9 4 .O 1.7 2.8 3 .O 4.7 2.5 2.5 29.7 2.5 1.8 7.1 14.7 1.7
1.8
0.7 0.7 0.7
0.7 0.8 0.8
0.8 0.8 0.7
3.1
3.6 40.8 3.7 2.7 11.6 6.8 2.1
0.7
J
..
218.7 333.3 379.4 ,
.
73.11 190.8 281.6 266.4
(23)
CONCLUSION I t has been shown that the factor analysis method described here can accurately determine the number of components in a series of mixtures. The method is computationally simple, rapid, and, as shown by Davis et al. (121, can easily be implemented on a laboratory minicomputer such as is commonly used with operational GC-mass spectrometry systems. Because of the ease of directly obtaining digitized mass spectra of gas chromatograph effluents of various compositions across a gas chromatography peak suspected of being a mixture, this method should be extremely useful in that application. Present results show the method to be both sensitive and reliable.
0.71 0.7 0.7 0.7
..
169.1 232.6 244.4 163.2j
There are zero residuals (Equation 19); therefore, the mixture contains two components. The digitized data used for the other three sets of mixtures are contained in Tables 111-V. Table VI contains the results of factor analysis of all these mass spectra. Several observations are worthy of note. First, in every case, the factor analysis method correctly determined the number of components in the mixture. Especially impressive were the two cases where initial analysis seemed to be in error (the hexane/cyclohexane and heptane/octane mixtures). For both of these, one component more than had been used in making the mixtures was calculated. A review of the experimental procedures suggested that nitrogen was a possible contaminant in the hexane/cyclohexane samples. If that were so, elimination of the mle 28 density data from the analysis should have yielded the “correct” answer. When this was done, factor analysis did determine that two components were present. In the case of heptanejoctane mixture, it was found by examination of the spectra that there was clear evidence of source contamination by nitrobenzene derivatives run previously. Accordingly, the “correct” answer of two components was wrong, since the spectra actually were those of three component mixtures.
Solving this for the eigenvalues and eigenvectors (Equations 9-10), the data in Table I1 are obtained. When the error estimation calculation is carried out (Equation 14), the error matrix below results:
r0.7 0.7
..
*.
Mixture A
279.7 348.3
..
Table V. Digitized Intensity Values-Unknown Xylene Mixtures m le
..
253.2 348.3 366.1
Because the matrix is symmetric, only half of it need be considered. Applying the criterion of Equation 19, there are ten residuals and thus a second approximation is required. Using a second approximation (including both the first and second eigenvectors) C2 is obtained:
1.1
0.5 0.2 0.6 0.9 0.5 0.3 0.6 0.1 0.2 0.3 0.4
240.9 331.4
(21)
Next, the covariance approximations (Equations 16-19) are used. The first approximation (using only the first eigenvector) gives Table VI. Factor Analysis of Mass Spectra of Mixtures No. of N o . of mje No. of
Mixture
mixtures
Cyclohexane /Cyclohexene Hexane/Cyclohexane
4 7
positions compo. used nents
20 17
2 2
18
Residuals 1
10
2
3
4
FA 5
6
7
0 0 0 - - -
28 0 0 0 0 0 0 28 22 0 0 0 0 0
number
Comment
2 2
...
3
If m/e 28 included Source found contaminated with nitrobenzene Unknown
HeptaneiOctane
5
16
2
10 1 0 0 0 0 - -
3
Xylenes
5
16
3
15
3
594
ANALYTICAL CHEMISTRY, VOL. 48, NO. 3, MARCH 1976
9 0 0 0 - -
...
ACKNOWLEDGMENT We thank Michael L. Gross of the University of Nebraska-Lincoln for providing the unknown xylene and cyclohexane/cyclohexene spectra.
LITERATURE CITED (1)L. L. Thurstone, "Multiple Factor Analysis", University of Chicago Press, Chicago, Ill., 1947. (2)H. H. Harman, "Modern Factor Analysis", University of Chicago Press, Chicago, Ill., 1967. (3)P. Morst. "Factor Analysis of Data Matrices", Holt, Reinhart and Winston, New York, N.Y., 1965. (4)R. J. Rummel, "Applied Factor Analysis", Northwestern University Press, Evanston, Ill., 1970. (5) R. Rozett and E. M. Petersen. Anal. Chem., 47, 1301 (1975). (6)R. M. Wallace, J. Phys. Chem., 64,899 (1960).
(7)D. Katakis. Anal. Chem., 37, 876 (1965). (8)J. J. Kankare, Anal. Chem., 42, 1322 (1970). (9)2. 2. Hugus. Jr., and A. A. El-Awady, J. Phys. Chem., 75, 2954 (1971). (10)N. Ohta, Anal. Chem., 45, 553 (1973). (11) D. Macnaughton, Jr.. L. B. Rogers, and G. Wernimont, Anal. Chem., 44, 1421 (1972). (12)J. E. Davis, A. Shepard, N. Stanford, and L. B. Rogers, Anal. Chem., 46, 821 (1974).
RECEIVEDfor review August 5, 1975. Accepted October 31, 1975. T. L. Isenhour is an Alfred P. Sloan Fel!ow, 1971-75. C. L. Wilkins is Visiting Associate Professor a t t h e University of North Carolina, 1974-75 Academic Year. Support of this research through Grant MPS 75-04259 (CLW) and Grant GP-43720 (TLI) by the National Science Foundation is gratefully acknowledged.
Information Theory Applied to Selection of Peaks for Retrieval of Mass Spectra Geert van Marlen Department of Analytical Chemistry, Delft University of Technology, Jaffalaan 9, Delft, The Netherlands
Auke Dijkstra" Department of Analytical Chemistry, State University of Utrecht, Croesestraat 77a, Utrecht, The Netherlands
By using Shannon's formula, amounts of information have been calculated for identification of binary coded low resolution mass spectra by retrieval. When a threshold of 1YO of the base peak is used for the decision about the presence or absence of a peak, these binary coded mass spectra yield an amount of information of approximately 40 bits. It is found that, for a library of ca. 10 000 mass spectra, a set of 120 preselected mass values in the range 1-300 contains the total informatlon: e.g., the nonselected masses do not supply any additional information.
The principles of information theory can be used to assess the amount of information obtained from the measurement of physical quantities. Using the amount of information and the correlation between these physical quantities, a set of characteristics can be selected which yields a maximum amount of information. A few years ago, Grotch ( 1 ) introduced the concept of information as defined by Shannon ( 2 ) in mass spectrometry. Grotch indicated that a mass spectrum yields an enormous amount of information, the exact amount depending on the number of peaks and the intensity levels that can be distinguished measuring these peaks. It was shown that, for binary coded spectra (peaks either absent or present), the number of bits obtained amounts roughly to 150 depending on the threshold level taken for the decision about the presence or absence of a peak. Erni (3) also calculated the information for a set of binary coded mass spectra. In a qualitative way, the correlations between the various masses were taken into account in a procedure for selecting the most suitable masses for retrieval purposes. In this paper, the results of some calculations of the information obtained from a retrieval procedure with mass spectra are presented. Using the information as a criterion and taking into account the correlations between the peak
occurrences, an optimal set of masses for retrieval has been selected. This study runs parallel to the calculation of information and the selection of gas chromatographic columns given in a recent paper by Dupuis and Dijkstra ( 4 ) . The procedures used for calculating the amount of information and for selecting an optimal set of gas chromatographic columns can, in principle, be used for calculating the information obtained from retrieval of mass spectra. However, calculations are less straightforward due to the binary coding of the spectra. The efficiencies of mass spectra coded in several ways might be measured in terms of information obtained. As such, the amount of information might serve as an alternative to the matching histograms developed by Grotch ( 5 ) . For a more detailed review of literature about retrieval of mass spectra, the reader is referred to ( 5 ) .
A M O U N T OF INFORMATION The amount of information from measuring a physical quantity-in this paper, the measurement of the intensity of a peak in a mass spectrum-is equal to the uncertainty with respect to the magnitude of this physical quantity before the experiment minus the uncertainty with respect to this magnitude remaining if the measurement is performed. Neglecting the uncertainty remaining after the experiment, so in the absence of experimental errors, the amount of information I according to Shannon ( 2 ) equals n
I = - Epildpi
(1i
i=l
where p L is the probability of measuring the intensity level i, and n is the number of intensity levels that can be distinguished. If only two intensity levels are distinguished, Equation 1 reduces to
1,(1) = -p Id
p
-
(1 - p ) Id (1 - 0 )
ANALYTICAL CHEMISTRY, VOL. 48, NO. 3, MARCH 1976
(2) 595