Principal-component analysis applied to combined gas

Principal-Component Analysis Applied to Combined. Gas Chromatographic-Mass Spectrometric Data. James E. Davis,1 Allan Shepard, Nancy Stanford,2 ...
0 downloads 0 Views 578KB Size
Principal-Component Analysis Applied to Combined Gas Chromatographic-Mass Spectrometric Data James E. Davis,l Allan Shepard, Nancy Stanford,* and

L. B. Rogers

Department of Chemistry, Purdue University, West Lafayette, Ind. 47907

Principal-component analysis provides a relatively rapid means for determining if there are two or more components in a single chromatographic peak when using the equivalent of a multi-channel detector. For a two-component system, changes in the shape of the plot for the scalars of the two major vectors were examined as functions of the extent of the overlap of the distributions, the relative amounts of the two components, and the tailing of peaks. The presence of a second component was easy to detect, both in simulations that were noise-free and in many of those to which 10% noise had been added. Some of the same characteristics were found in overlapped chromatographic peaks for masses 44 and 45 for the isotopic carbon species of carbon dioxide and for a mixture of n-hexane and n-heptane for which six masses, that intentionally did not include those for the molecular ions, were used.

A single chromatographic peak may be made up of one or more components. If the peak shape is known a priori, one can use relatively simple means to decide if two or more species are present. For example, measurements of width-to-height ratios or of peak moments are diagnostic. On the other hand, simple criteria cannot be used with confidence when peak shapes are unknown ( I ) . We have explored the use of principal-component analysis for this purpose. Principal-component analysis (PCA), sometimes known as factor analysis or eigenvector analysis, is finding increased use in analytical chemistry as evidenced by the references in our earlier paper (2) and by other applications in chromatography (3-6). To apply PCA to the detection of a second species in a “single” peak, we have obtained mass spectra at fixed time intervals across a chromatographic peak and have considered each mass spectrum to be a vector with as many dimensions as there are mass-to-change channels. The successive mass spectra represent a set of vectors that is first collected into a data matrix. Then, that matrix is transformed into another having the same amount of “information” as the original and also several new features. The new set of vectors, eiPresent address, D e p a r t m e n t of P a t h o l o g y a n d M e d i c i n e , W a s h i n g t o n U n i v e r s i t y a n d B a r n e s H o s p i t a l , St. L o u i s , Mo. 63110. Present address, D i v i s i o n of H e m a t o l o g y , D e p a r t m e n t of M e d icine, W a s h i n g t o n U n i v e r s i t y , S t . Louis, Mo. 63110. (1) E. Grushka. Chem. Techno/.. 1971, 745. (2) D Macnaughtan, L. 8. Rogers, and G Wernimont. Anal. Chem.. 44. 1421 (1972) (3) N Hartmann and S J Hawkes, “Gas Chromatography, 1970,” A. Zlatkis, E d . , Chromatography Symposium, Univ of Houston, HOUSton, Texas, 1970, p 84 (4) S C Elliot N A Hartmann. and S J Hawkes. A n a / . Chem.. 43. 1938 (1971). ( 5 ) P H . Weiner and J. P. Parcher, J. Chromatogr. Sci., 10. 612 (1972) (6) S. Woid and K Andersson, J . Chromatogr.. 80. 43 (1973)

genvectors, are orthogonal so it is not possible to reproduce any one as a linear combination of the others. Furthermore, in our algorithm, the vectors appear in decreasing order of the corresponding eigenvalues. If the amount of the second vector (value) is small and, therefore, doubtful, we have found that plots of the scalars for the first and second vectors can provide a reliable visual basis for making a decision about the presence of a second component. To understand these plots, it is important to know the following facts about scalars and eigenvectors and also the assumptions made in using them. First, the sum of the eigenvalues is the value of the trace, an invariant attribute of the original set of vectors. Then, assuming that the eigenvectors which correspond to suitably small eigenvalues can be ’discarded without losing significant information and that the number of eigenvectors remaining is equal to the number of physically distinct components (species), a linear combination of the remaining eigenvectors should reproduce adequately any of the original mass spectra. The weighting factors in that combination are called scalars, and a set of scalars uniquely describes each mass spectrum. The number of scalars in a set is equal to the number of retained eigenvectors, and that number is generally smaller than the number of mass channels found in the spectra. In our study, we deliberately added a second component so as to explore any limitations of our procedures for finding it. Both simulated data and real data have been examined with respect to the percentage of the trace represented by the second vector and, also, with respect to the shape of the scalar plot. The simulations explored the effects of differences in the separation of peak maxima, in peak width, in relative skew, and in noise. We report the interpretation of scalars only for the purpose of detecting the presence of a second component. This investigation does not deal with the very difficult problem of finding the linear combination of eigenvectors which form the mass spectrum for each pure component. The latter problem is being investigated by Rosenthal(7). Two different experimental examples have been examined. The first involved the partial separation of 13C1602 and 12C1602, first in a 1:l mixture and then in their natural abundances. Carbon dioxide represents a trivial case in the sense that the signals for masses 44 and 45 were independent of one another (mathematically orthogonal). Hence, the conclusions from PCA could, if necessary, have been checked easily by other methods. The second, in which we deliberately chose conditions so as to produce overlapped peaks of n-hexane and n-heptane, represents a case where a unique channel for each substance was not available because peaks for the molecular ions were deliberately omitted. In that situation, one alternative to the use of PCA would have been to use the known mass spectra to set up simultaneous equations. However, PCA can, ( 7 ) D Rosenthal, Research Triangle Institute. Durham, N C 27709 personal communication, July 12, 1972

A N A L Y T I C A L CHEMISTRY, VOL. 46, NO. 7, JUNE 1974

821

Table I. Analyses of Simulated Data A. Effect of Peak Separation for Peaks of Equal Heights and Widths

Vector SP.

1, ‘70

7 ,

6.00 2.00 1.00 0.41 0.16

58.3 64.3 83.6 97.1 99.5

Vector 2, %

Remaining, %

Figure

2 x 10-7 3 x 10-7 5 x 10-7

41.7 35.7 16.4 2.9 0.5

x 4 x 1

IA IB

IC

10-5

3

10-5

ID

B. Effect of Peak Separation in the Presence of 10% Noise

2.00 1.00 0.41 0.16

62.9 82.4 95.8 98.2

36.0 16.6 3.2 0.8

C. Effect of Relative Height at 2

1.00 0.98 0.93 0.91 o

3

Separation

Rel. Ht

1: 1 1:0.3 1:O.l 1:0.03

64.3 92.3 99.1 99.9

35.7 7.7 0.94 0.86 X

2 x 10-7 6 x 1.3 x 10-5 1 . 1 X loF5

4c 4B 4A

D. Effect of Relative Width at Zero Separation Rel. Width

1:1.15 99.8 1:1.15” 98.6 1 : 1 .Otla 98.7

0.2 0.6 n o n e found

-1 x 10-5 0.9 1.3

5 5

E. Eflects of T,ailing o n Noise-Free Peaks of Equal Heights and Equal Widths (before Using (I = 0.1 for the Exponential Smooth)

l.OOb 0.82b 0.41b 0 1.00~ 0.82c 0.4lc 1.00d

0.82d 0.41d

74.8 98.5 99.1 95.4 74.8

79.7 88.3 89.4 93.3 98.2

25.2 1.5 0.9 4.6 25.2 20.3 11.7 10.6 6.7 1.8

a 10% noise. Only the first peak was tailed. tailed. Equal tailing on both peaks.

x x x x x x x 1 x -3 x

-4 -2 -6 -4 -4 -5 -3

10-4

lo-‘

6A 6B

10-5 10-4 10-4 10-4 10-7 lo-‘

Only the second peak was

Chromatographic columns were prepared as follows. For the carbon dioxide study, 100-125 mesh Porapak Q (Waters Associates) was packed into a 4-mm X 4.2-m column. A 4-mm X 1.0-m column, loaded with 3.5% by weight SE-30 on Teflon (DuPont), was used for separating the n-alkanes. Apparatus. A Varian Aerograph 660 was connected by means of a Biemann separator to a UTI Model lOOC quadrupole mass spectrometer. For preliminary measurements, the mass spectrometer was operated manually, and the output was fed either to an oscilloscope or a recording potentiometer. However, during data acquisition, selection of mass-to-charge ratio and of sensitivity of in their natural the amplifier (a ratio of 1OO:l for 13C and abundances) for the electron multiplier was done by computer. Mass selection was made using a 14-bit digital-to-analog converter (Analog Devices). Procedures. Real and simulated data were obtained by means of a PDP-llIZO computer that employed 1-8 user BASIC modified for on-line real-time control and data acquisition (9). Likewise, the mathematical calculations for the eigenvector procedure (2) were written in BASIC and were verified by working out the example in Simonds’ paper (10). Since we considered only binary mixtures, only two eigenvectors were retained. When an iteration produced a difference between successive estimates for an eigenvector of less than &0.01%, of its magnitude, calculation of that vector was stopped. The scalars were calculated and plotted (2) on suitable scales for both axes so as to “fill” the page of plotting paper. In the simulations of mass spectrometry, the equivalent of a series of 5 mass-to-charge values was “measured” as a function of time. Hence, the columns in the data matrix would correspond to the different masses and the rows, to the times when they were measured. The mass spectrum for one component was assigned relative abundances for successive “masses” of 0.0, 0.0, 0.5, 1.0, and 0.5;for the second component, the values were 0.5,1.0, 0.5, 0.0, and 0.0. A chromatographic peak of Gaussian shape and standard deviation, a, was generated for each species. Each peak was 20-30 points wide (4 a) out of a total of 200 points. Then, the corresponding mass spectrum was formed using the relative weights mentioned above for each mass channel. The final spectrum was a channel-by-channel sum of the values of the pair of contributing species. Noise was added to the simulated data from a random-number generator which had a probability distribution of uniform amplitude. Actually, that noise distribution provided a much more difficult test than a Gaussian distribution and was not an unlikely one to be found in digitized data. In the absence of noise, only 2 or 3 iterations were required, whereas up to 8 were sometimes required, with 5 being the average, in the presence of 10% noise. The effect of tailing was examined using a single exponential smoothing function: Sl=(l-a)S~,-1,+cyY,

in principle, be extended to mixtures involving unknown species. In a similar situation, Klein e t al. (8) used probit analysis which gives highly accurate information on retention time and standard deviations. However, probit analysis tends to fail for non-Gaussian curves. In addition, probit analysis does not show the extent of correlation between data from different detector channels, and one must do a least-squares fit for each detector channel. On the other hand, PCA makes no assumption about curve shape, the number of components, or their spectra. Furthermore, because it derives correlations between channels, it reduces the number of parameters. Finally, PCA provides a relatively rapid way of determining how many components are present in a single chromatographic peak.

EXPERIMENTAL Reagents. Two sources of carbon dioxide were used. The first was “bone dry” tank gas from Matheson Gas Products. The second source was carbon dioxide, 90% enriched in carbon-13, from the Mound Laboratory of Monsanto Research Corp. The n-hexane and n-heptane were 99% pure from Phillips Petroleum Co. (8) P D Klein, D W Sinborg, and P. A. Szczepanik. Pure Appl. Chem , 8, 357 (1964)

822

A N A L Y T I C A L CHEMISTRY, VOL. 46, NO. 7, J U N E 1974

( 1)

where S, is the smoothed value (running sum), Y, is the current point to be smoothed, and cy is the weight assigned to the current point (11). By using an cy of 0.1, the resulting curve-shape was the same as that obtained from Equation 3 of Gladney et al. ( 1 2 ) . Real mass spectrometric measurements were performed using two different systems. For carbon dioxide, masses 44 and 45 were measured. For mixtures of n-hexane and n-heptane, masses 42, 43, 56, 57, 70, and 71 were selected because they were present in the spectra of both hydrocarbons. Note that masses 86 and 100 for the singly-charged molecular ions were purposely not included so as to provide a more stringent test for the method.

RESULTS Simulated Data. The effects of differences in peak separation, peak widths, peak heights, and tailing have been examined separately, first for noise-free Gaussians and then in the presence of 10% noise. The extent of peak separation was found to have an effect on both the percent-

J. E Davis and L B. Rogers, “Real-time overlay for 1-8 user BASIC-11 , ” DECUS Program Library, No 11-95 (1973) J. L Slmonds, J. Opt. SOC.Amer., 53, 968 (1963). R . G. Brown, “Smoothing, Forecasting and Prediction of Discrete Time Series,” Prentice-Hall, Englewood Cliffs, N.J , 1963, p 132. H. M . Gladney. B. F. Dowden, and J D Swalen, Anal Chem.. 41, 883 (1969)

N

w

a i a 0 II)

SCRLPR

1

Figure 1 . Plot of scalars for various separations of peak maxima. The peaks were Gaussians of equal height and width ( A ) 6 u separation, ( E ) 2

6 ,(C)

1 6 ,( D ) 0.16 u

SCRLRR

1

Figure 3. Effect of noise on the scalar plot for Gaussian peaks of equal heights and widths at 0.41 u separation. The solid line

is for noise-free data

SCRLRR

TInE

I

Figure 2. Scalars for second eigenvector vs. time. The peaks were Gaussians of equal heights and widths

Figure 4. Effect of peak height on the scalar plot

( A ) 4 u separation, (6) 0.16 u

( A ) 1 0 03, (6) 1 0 10, ( C ) 1 1

age of the trace of the matrix accounted for by the first vector and on the shape of the resulting scalar plot. Section A of Table I shows that, as the peak maxima moved closer together, the relative size of the first eigenvalue increased-ie., the percentage of the trace associated with the first vector increased. Figure 1 shows that, a t a separation of 6 u , there were widely separated, symmetrical, and clearly delineated rays produced in the plot. Had there been only a single component present, PCA calculations would have reported only one vector and in the plot, only a single ray-ie., a line-would have been seen coming from the origiin, 0. At a modest separation of 2 u , where approximately a 20% valley would be found in a chromatogram, two lobes were obtained that had intermediate points close to the origin. As the separation decreased further, the intermediate points moved outward on the scalar plot until a drop-like shape was approached. Another viewpoint may be gained by plotting the scalars for the second vector (vertical axis) as a function of time. Curve A of Figure 2 shows that a plateau was formed between the maximum and minimum when there was a “base line” separation (6 a) of the original peaks. For smaller separations (Curve B ) , the number of points between the maximum and the minimum on the scalar plot did not appear to bear any simple relationship to the extent of separation. Section B of Table I shows that the presence of noise decreased the size of the first eigenvalue and increased the percentage of the trace that remained unassigned. At

1

Gaussian

peaks of equal widths at 2 u separation

the same time, estimation of the eigenvectors required more iterations, going from an average of about 3 to about 5. The presence of noise also caused scatter in the plotted points, as shown in Figure 3, but it did not prevent easy detection of the second component from the shape of the plot, even a t a separation of 0.41 u . Furthermore, the spread of the cluster of points near the origin permitted one to estimate visually the magnitudes of the noise along X and Y axes. If the program had found a second vector, which was actually noise rather than a second chemical species, a single ray having a wide scatter of noise, especially around the origin, would then be seen in the scalar plot. The effect of different peak heights, at a constant separation of 2 u , is shown in section C of Table I. As the ratio of the heights changed from unity, the first eigenvalue increased. Figure 4 shows that there was an apparent rotation of the limiting rays and an increase in the angle between them. Although the effects of noise have not been shown, the changes in the first vector and in the percentage of trace that remained unassigned were similar to those reported in Section B of Table I. Section D of Table I shows a limiting case for the effect of a difference in peak width at zero separation of the maxima. In the presence of 10% noise, a 15’70 difference in width was detectable, but an 8% difference was not. Furthermore, in the latter case, only one vector was found even though a substantial percentage of the trace remained unassigned. As Figure 5 shows, there were no A N A L Y T I C A L C H E M I S T R Y , V O L . 46, N O . 7, J U N E 1974

823

Table 11. Experimental Results for Binary Mixtures

0.022a 0.022b

Vector 1,

Vector 2,

%

%

99.8 99.8

0.2 0.2

Remaining, %

2 2

x x

10-4 10-4

B. n-Hexane-n-heptane, 4:s mixture (by volume)

3.28 1.88 C

d d SCFILRR

1

Figure 5. Scalar plot for Gaussian peaks of equal heights and zero separation but different widths. The solid line is for noisefree data

(100 (125 (135 (150 (175

"C) "C) "C)

97.8 98.8 98.1 98.6 98.8

"C)

"C)

2.1 1.2

1.8 1.3 1.1

7 x lo--* 7 X lo-' 1 X 10-1 8 X 1 X lo-'

'

1 :1 mixture (by volume). Natural abundance. Shoulder for n-hexane on larger peak for n-heptane. Single peak. Q

SCRLRR 1 SCRLRR

1

Figure 6. Effect of peak separation on the scalar plot of a Gaussian and a skewed Gaussian (rt = 0 1 ) having equal heights and widths (before skewing) ( A ) 0 8 2 u separation ( 6 ) 0 41 n separation

longer two distinct lobes, but, in the noise-free case, one can see that there was a single curved line. This was due to the fact that the centers of symmetry for the peaks were the same, and there was no peak separation to produce an angle between the limiting rays. This can be grasped more easily by going to Figure 6 and section E of Table I where the effect of tailing on the first chromatographic peak can be observed. Note that the scalar plot no longer retraced itself. Furthermore, as section E shows, the presence of tailing on both peaks made easier the detection of the second component in the sense that the percentage for the second vector was larger for a given value of peak separation. Real Data. The separation of the isotopic carbon species of carbon dioxide was a trivial example for two reasons. First, the electron-multiplier signals for masses 44 and 45 were clearly independent of one another. Second, because only two masses were measured, there could be no reduction in the number of vectors necessary to reproduce the original data. However, it did serve as a convenient check on some of the conclusions that resulted from the simulations, and it also illustrated how to minimize a n instrumental limitation of the detection system when looking for a trace component. Because of the very small separation of the peaks, a very large portion of the trace was accounted for by the first eigenvalue as shown for the 1:l mixture in section A of Table 11. However, when samples that contained the 824

A N A L Y T I C A L CHEMISTRY, VOL. 46, NO. 7, JUNE 1974

Figure 7. Scalar plot for natural-abundance carbon dioxide natural abundances were measured using the same amplifier sensitivity for both masses, only one vector was usually detected. By increasing 100-fold the sensitivity of the amplifier when measuring mass 45, the signal-to-noise was improved substantially so that a second vector was always detected. As shown in Table 11, the rounded percentages for the two vectors were identical with those for the 1:l mixture. Figure 7 shows how easy it was to detect the second vector and how little confusion the noise introduced. The hexane-heptane case was a more realistic one. As pointed out earlier, measurements were not made of the molecular ions so as to eliminate unique masses. In addition, a relatively short, lightly loaded column was used at elevated temperatures so as to ensure bad overlap of the components. The scalar plots are not shown because, in the worst case (175 "C),the plot very closely resembled that in Figure 7 . Hence, the second component was easily detected. However, one aspect to note in Table I1 is that the percentage of the trace represented by the first vector was high, and it did not change much. The fact that it did decrease a t intermediate values of u appears to be real, and it has been attributed to tailing of the peaks because of a similar trend in section E of Table I.

DISCUSSION The approach reported in this study should be particularly useful for detecting rather quickly the presence of more than one component in a chromatographic peak that appears visually to represent only one pure species. I t is a n especially powerful approach because no prior assumption is required concerning the peak shape of any component. Hence, peak distortion from chemical or electronic

sources, if consistent, will not interfere. This was proved by performing simulations in which channel-to-channel carry-over was tested in a manner parallel to that used for peak tailing (carry-over within a given channel). Carryover was found to have no significant effect on the interpretation of the results. Likewise, changes in base line were found to introduce no noticeable obstacle to the recognition of a second component. The chief anomaly we encountered was in analyzing noisy data where noise sometimes constituted a second vector. However, a plot of the data allowed one to diagnose that situation by a quick visual inspection as described above when discussing the results in Table IB. An entire calculation and plot of the data (for a 200 by 5 matrix) can be performed in 3 to 5 minutes even when several iterations are required to find the vectors. Another advantage of using PCA is that, in principle, the process of selecting the mass channels to be measured is a problem of secondary importance. The input of “data” from channels that have no significant information to contribute will be discarded by the mathematical analysis in the same way as noise. The current study has been concerned only with the interpretation of the scalars resulting from a principal-com-

ponent analysis of chromatographic data. The fact that negative values for percentages of components is ruled out makes the interpretation of PCA, as applied to chromatography, straightforward (13). More important, further analysis of the information should be facilitated by using the scalars as a pre-processed data set. Hence, the potential reduction in the number of variables should speed up later analyses either by curve-modeling techniques or by learning-machines. Indeed, when only two eigenvectors remain, it should sometimes be possible to select, by inspection, limiting rays which specifiy the ratio of the eigenvectors required to generate the mass spectra for the pure components. Having accomplished that, the chromatogram for the completely resolved components could then be constructed. Received for review September 12, 1973. Accepted February 20, 1974. Supported in part by the U S . Atomic Energy Commission under Contract AT( 11-1)-1222. Presented a t the Pittsburgh Conference on Analytical Chemistry and Applied Spectroscopy, March 9, 1973.

(13) W . H

Lawton and E A . Sylvester, Technometrics, 13, 617 (1971)

Chromatographic Separation of Metal Ions on Low Capacity, Macroreticular Resins James S. Fritz and James Ames Laboratory-USAEC

N. Story

and Department of Chemistry. Iowa State University. Ames. lowa 50070

Forced-flow chromatography on partially sulfonated, macroreticular resin beads is used to obtain several rapid metal ion separations. Separations of thorium( I V ) from lanthanum(lll); thorium(lV) from zirconium(1V); calciu m ( I I ) from magnesium( I I); zinc( I I),lead( I I); copper( I I), maganese(ll), and nickel(l1) from each other, and a separation of lead(l1) from large amounts of other divalent ions are demonstrated. Low-capacity resins provide rapid separations in strongly acidic eluents of moderate concentrations. In-stream addition of color-forming reagent provides continuous detection and accurate quantitation of eluted metals.

The preparation and characterization of a series of lowcapacity partially-sulfonated cation-exchange resins has been described in another paper ( I ) . Macroporous highly cross-linked polystyrene resins were sulfonated a t temperatures ranging from 2 to 175 “C for varying lengths of time. The capacity of the resins ranged from 0.23 to 3.70 mequiv/gram. Distribution coefficients of metal ions with these resins are significantly different from those with conventional ion-exchange resins. In some cases, the separation factor for two metal ions is considerably more favorable with the lower capacity resins. The excellent stability and physical properties of the partially sulfonated resins make them desirable for use in forced-flow chromatography. (1) J S

FritzandJ N Story

J Chromatogr

submitted

Recently, the application of high pressure chromatographic techniques to ion-exchange separation of metal ions has resulted in greatly reduced analysis times (2-6). High mobile phase velocities, small particle sizes, and instream detection have combined to permit the separation and analysis of many different metal ions in 5 minutes or less. In this paper, the new lower capacity cation-exchange resins are used for a number of rapid chromatographic separations of metal ion mixture. Only a few, selected separations are reported. The intent of this work is to demonstrate the usefulness and advantages in some cases of separations using the new ion-exchange resins in forced-flow chromatography.

EXPERIMENTAL Apparatus. The chromatograph used in this work is outlined in Figure 1. All valves, fittings, and columns were either purchased from Chromatronix, Inc. of Berkeley, Calif., or were machined from raw materials. All connecting tubing in the eluent system was 0.031-in. i.d. Teflon tubing. The chromatograph shown was designed to allow only glass, Teflon, or Kel-F plastic to come into contact with the mobile phase. The eluent tank was a simple glass bottle or beaker containing the eluent. The mobile phase was pumped from the eluent tank ( 2 ) Mark D. Seymour, John P. Sickafoose, and James S Fritz, Ana! Chem.. 43, 1734 (1971). ( 3 ) M . D. Seymour and J. S. Fritz, Ana! Chem.. 45, 1394 (1973). (4) M . D.Seymour and J . S. Fritz, Ana/ Chem.. 46, in press. ( 5 ) James S.Fritz and John P. Sickafoose, Taianta. 19, 1573 (1973) (6) Kazuyoshi Kawazu and James S. Fritz, J . Chromatogr.. 77, 397 (1973). A N A L Y T I C A L C H E M I S T R Y , VOL. 46, NO. 7 , JUNE 1974

825