Comparison of canonical variates analysis with target rotation and

Lloyd V. Vallis, Halliday J. MacFie, and Colin S. Gutteridge. Anal. Chem. , 1985, 57 (3), ... Carter , Mark T. Hatcher , and Larry. Di Carlo. Analytic...
0 downloads 0 Views 650KB Size
Anal. Chem. 1985, 57,704-709

704

ACKNOWLEDGMENT We wish to thank F. W. McLafferty for useful comments and suggestions. Preparation of this manuscript was done partly while D.M.H. was a guest at the University ofMiinster, West Germany. Thanks are due to the von Humboldt Foundation and to A. Benninghoven for their hospitality and support.

Registry No. DOPA, 59-92-7; glycine, 56-40-6; L-alanine, 56-41-7; @-alanine,107-95-9;L-valine, 72-18-4;L-leucine, 61-90-5; L-isoleucine, 73-32-5; L-proline, 147-85-3;3-aminobutyric acid, 541-48-0; 4-aminobutyric acid, 56-12-2; 6-aminocapric acid, 15623-17-3;L-histidine, 71-00-1; 3-methylhistidine, 368-16-1;Lphenylalanine, 63-91-2;L-tyrosine, 60-18-4; L-tryptophan, 73-22-3; L-phenylalanine methyl ester, 2577-90-4;L-proline methyl ester, 2577-48-2; glycine ethyl ester, 459-73-4; NJV-dimethylglycine, 1118-68-9;phenylalanine methyl ester hydrochloride,7524-50-7. LITERATURE CITED Cotter, R. J. Anal. Chem. 1980,5 2 , 1589A. W e y , H. D. h f .J . Mass Spectrom. Ion mys. 1989,2 , 500-5138. Benninghoven, A.; Jaspers, D.; Sichtermann, W. Appl. phvs. 1978, 7 7 , 35-39. Barber, M.; Bordoli, R. S,; Sedgwick, R. D.; Tyler, A. N. J . Chem. Soc., Chem. Commun. 1981. 325-327. Macfarlane, R. D.: Twgerson, D. F. Sclence 1978, 797, 920-925.

Conzernius, R. J.; Capellen, J. M. Inf. J . Mass Specfrom. Ion Phys. lB80. . ~ 34. . ~ 197-271. . . -. . Junk, G.; Svec, H. J . Am. Chem. Soc. 1983,8 5 , 839-845. Winkier, H. V.: Beckey, H. D. Org. Mass Specfrorn. 1972, 6 , 655-660. Milne, G. W. A.; Axenrcd, T.; Fales, H. M. J . Am. Chem. Soc. 1970, 9 2 , 5170-6175. Surman, D. J.; Vickerman, J. C. J . Chem. Res. 1981, 170-171. Schiller, C.; Kupka, K.D.; Hillenkarnp, F. Fresen/us' 2.Anal. Chem. 1981,308,304-308. Voigt, D.; Schmidt, J. Blomed. Mass Specfrom. 1978,5 , 44-46. Denoyer, E.; Van Grleken, R.; Adams, F. D . F . S. Nafusch Anal. Chem. 1982, 5 4 , 26-41A. Parker, C. D.;Hercules, D. M., submitted to Anal. Chem. Leclercq, P. A.; Desiderb, D. M., Org. Mass Spectrom. 1973, 77, 515. k t - N e r , M.; Field. F. H. J . Am. Chem. SOC.1973,95, 7207-7211. Tsang, C. W.; Harrison, A. G. J . Am. Chem. SOC. f978, 98, 130 1 1308. Longevialle, P.; Girard, J. P.; Rossi, J. C.; Tichy, M. Org. Mass Spectrom. 1979, 14, 414-415. Mclefferty, F. W.; Barbalas, M. P.; Turecek, F. J . Am. Chem. SOC. 1983, 705, 1-3. Parker, C. D., unpublished studies of dicarboxylic acMs, Universtty of Pittsburgh, 1983. Bakd, R.; Winstein, S. J . Am. Chem. Soc. 1983,8 5 , 567-578. Biemann, K.; Siebl. J.; Gapp, F. Blochem. Biophys. Res. Commun. 1959, 7 , 307.

-

RECEIVED for review May 10,1984, Accepted November 15, 1984. This work was supported in past by the Office of Naval Research.

Comparison of Canonical Variates Analysis with Target Rotation and Least-Squares Regression as Applied to Pyrolysis Mass Spectra of Simple Biochemical Mixtures Lloyd V. Vallid and Halliday J. MacFie* Agricultural Research Council, Meat Research Institute, Langford, Bristol BS18 7DY, United Kingdom

Colin 5.Gutteridge Cadbury Schweppes PLC, Group Research, Lord Zuckerman Research Centre, The University, Whiteknights, Reading, Berks, United Kingdom

The estimation of the number, identity, and proportions of biochemical compounds present in mixtures is often carried Out by target transfonnatlon. We have previously shown this to be equivalent to poJectkrrIn a prhclpal component *ace. I n this paper we derive an analogous method, baged on canonkal variates analysis, that uses the replkate infonnatkn from the samples. The superkr performance of this approach is demonstrated by using pyrolysis mass spectra of mixtures of glycogen, Uextran, end bovlne serum albumin.

The estimation of constituents in mixtures from mixture spectra of various kinds is an important problem in analytical chemistry. A range of statistical techniques have been used for extracting information from the mixture spectra, including regression analysis, generalized spectral subtraction, principal components analysis, (sometimes called principal factor analysis), and canonical variates analysis (discriminant analysis). Principal components analysis (PCA) is the method mbst commonly used. It appears in the work of Lawton and Sylvestre (1)on the deconvolution of mixture spectra and in target transformation (2, 3 ) , which has been shown to be 0003-2700/85/0357-0704$01.50/0

equivalent to projection on to principal component vectors (4), and in the graphical rotation method of Windig et al. (5). When replicate mixture spectra are available it may be possible to obtain improved results by using a method which maximizes differences between mixtures in relation to differences within mixtures (6). These authors combine principal components analysis, discriminant analysis, and graphical rotation to interpret sets of pyrolysis mass spectra from biological materials. The technique they call discriminant analysis is often called canonical variates analysis (CVA) in statistical texts (7-10). In this paper we describe the CVA analogue of projection onto PCA vectors (target transformation). A comprehensiveset of pyrolysis mass spectra data from mixtures of three biological compounds is analyzed by both methods. Some results from ordinary least-squares regression analysis are also included for comparison.

THEORY Mixture Model. A basic assumption in the analysis is that the constituent spectra combine linearly to produce a mixture spectrum (Beer's law). Expressed mathematically y=Xb+e (1) where y is a p X 1 vector of observable spectral values, X is 0 1985 American Chemical Society

ANALYTICAL CHEMISTRY, VbL. 57, NO. 3, MARCH 1985

a p X k matrix with known constituent spectra as its columns, b is a k X 1vector of unknown proportions, and e is a p X 1 vector of random errors with mean 0. The elements of y, X , and b are nonnegative, and the elements of b sum to 1. The observed spectra may contain purely relative intensities, so that only the shape of a spectrum is significant. Then it is usual to normalize the spectra in some way, e.g., by scaling so that the sum of the intensities in a spectrum is a constant. Additional constraints may be imposed on X o r b in special cases. For example the constituent spectra xJ,j = 1 to k , may be assumed to have shapes defined by known shape functions with few unknown parameters (11). In the case of a set of mixtures y,, j = 1 to n, there may be a parametric model relating the unknown proportions b, to some known experimental variables such as time or initial proportions (12). We consider the case of a set of related mixtures of a few constituents in which there is no mathematical model relating the constituents proportions with an “external variable” such as time. It is assumed that reference spectra for a number of likely constituents are available. The aim of the data analysis is to find the number, identity, and proportions of constituents in the set of mixtures. CVA (1) Data, of Full Rank. This is the ordinary CVA described in textbooks. By full rank data we mean that the within-mixtures sample covariance matrix has an inverse. This implies that where n p-dimensional observations (spectra) have been obtained from g mixtures. Rarely, collinearity of observations may cause singularity of the sample covariance matrix (no inverse exists) even when condition (2) is satisfied. We now proceed to show how CVA of mixture data may be used in conjunction with reference spectra of likely constituents to determine the number, identity, and proportions of constituents in a set of mixtures. CVA (2)Data of Less Than Full Rank. In the context of CVA, data of less than full rank is such that the withingroups (e.g., within-mixtures) sample covariance matrix is singular. This implies that

(n - g)

P

(3)

or that at least one linear relationship among the components of the data vector holds for every data unit. In the rest of this paper we assume that singularity is due solely to condition (3). The problem may be avoided by preliminary data reduction to reduce p so that (n - g) Ip. Several methods of data reduction are available. The one which we u8e here is principal components analysis (PCA). , The basic idea is that most of the variation in the data can be retained in a relatively small number of principal components which are then subjected to CVA. The canonical variates from the reduced data are used in the same way as canonical variates from data of fuIl rank in the mixture analysis. The principal components me obtained from the total covariance matrix. When p > (n - 1) a worthwhile saving in computer memory may be made by extracting principal components by a principal coordinates analysis (13) of the n X n association matrix. The question arises as to how many PC’s to use in the subsequent CVA. The maximum number one can use is (n- g). Clearly one wants to avoid using very small Pc‘s which could receive spuriously large weights in the CVA due to rounding errors. Windig et al. (ti),working with correlation matrices of 100 variables, discard the principal components corresponding to latent roots less than one. Hoogerbrugge et al. (14)state, without explanation, that the number of PC’s retained have to be less than a quarter of the number of spectra in the data set. They routinely use selection

705

of Pc‘s which account for more than 1% of the total variance. We have also found this selection rule satisfactory to date for mixture analysis, but it is unlikely to be satisfactory as a general rule. Estimating the Number of Constituents. This estimation is based on examination of the latent roots A, obtained as solutions to the equations

Blj = AjWlj, j = 1, ..., t

(4)

in which B and Ware symmetric p X p matrices of sums of squares and products (SSP matrices) for variation between mixture spectra and between replicate spectra within mixtures, respectively, and t = min (p, g - 1)is the maximum number of nonzero A,. Here 1, is the latent vector associated with A,. Variation in the proportions of k constituents in the mixtures should produce k - 1 major roots and a set of minor roots which are separated from the major roots by a distinct change of magnitude. This implies that the mixture means lie close to a (k - 1)-dimensionalhyperplane. The residual distances from the hyperplane are due to experimental errors and also to deviations from the mixture model. This examination can be aided by plotting the roots against their ranks. One can also carry out a formal x2 significance test on the equality of remaining roots after excluding the k - 1 larger roots (15,16).However, the use of the x2 tests assumes large samples, which rarely occurs in practice. Identifying Constituents. If a ( k - 1)-dimensional subspace (hyperplane) has been found for the mixture set, one can test for the presence of a potential constituent by looking at the position of the canonical variate transformation of its spectrum. Potential constituents can be tested independently in this way. Two criteria should be satisfied. (1) Canonical variate residuals of constituents should be comparable with those of the mixtures. Canonical variate residuals are simply the canonical variate scores on all but the first k - 1canonical axes. Suppose c2 is the t - k 1dimensional vector of canonical variate residuals for a potential constituent, where t = min(p, g - 1)is the full dimensionality of the canonical space, and the g columns of the matrix Z2are corresponding canonical residuals for the mixture means. Residual mean squares can be defined for the mixtures and a constituent respectively as

+

sM2

= trace (Z2Z2’)/((t -k t

+ 1)(g- 12))

(5)

I

and

sc2= trace (c2c2)/(t- k t

CCij2/(t - k i=k

+ 1)

+ 1)

Then the ratio may be used in an approximate F test with ( t - k + 1)and (t - k + l)(g - k ) degrees of freedom to judge a potential constituent. The foregoing method is analogous to the use of residuals in the SIMCA method (17)for classifying test objects. (2) The canonical variate transformations of constituent spectra should form a region enclosing the mixtures in the mixture subspace. This can be checked graphically if the mixtures consist of only a few constituents. Alternatively, the distance of a test spectrum from the mixtures mean can be compared with that of its nearest neighbors. These distances are computed using only the first k - 1 canonical variate axes.

706

ANALYTICAL CHEMISTRY, VOL. 57, NO. 3, MARCH 1985

Table I. Latent Roots from Principal Components and Canonical Variates Analysis

of 63

Mixtures of GLY, DEX,and BSA

latent roots (as % of total)

method

data analyzed

no. of variables

1

2

3

4

5

6

7

8

remainder

PCA PCA CVA CVA

mean spectra replicate spectra (ungrouped) first 8 PC’s of ungrouped replicate spectra replicate spectra

104 104 8 104

62.1 44.0 79.2 71.5

17.4 27.9 17.5 9.2

1C.3 9.0 0.8 3.7

1.7 3.2 0.7 2.3

1.4 2.5 0.6 1.8

1.1 1.7 0.5 1.4

0.8 1.5 0.4 1.1

0.7 1.1 0.3 1.0

4.5 9.1 0.0 8.0

This implies that the full canonical variate test vector should form a reasonably small angle with the mixture hyperplane. Note that the vectors are measured from an origin taken at the mixtures mean. At present we do not have an objective criterion for the angle but suggest that a test vector forming an angle of cosine less than 0.9 with the mixture hyperplane is a strong candidate for rejection. In practice both the squared distance S,2 and the cos criteria would be examined. Estimating Proportions. Once the constituents in the set of mixtures are identified, the proportions may be estimated from the canonical variate transformation of eq 1,using only the first k - 1 canonical variates, contained in the k X ( k - 1) matrix L1. That is Ll’y = L1’(Xb e) (10)

100%GLY

+

and by obvious substitutions

= Clb + hi (11) where the subscript 1 in (10) and (11) means that the subscripted matrix or vector is the partition corresponding to the first k - 1canonical variates. Thus C1is a (k - 1) x k matrix. We can augment eq 11with the constraint that bl + ... + bk = 1, which can be written in matrix form as l’b = 1 (12) Hence 21

and thus the estimates of the proportion are

= A(

);

where A is the inverse zatrix in eq 14. A covariance matrix (conditional on A) for b is

= (l/r)A

(‘0

:)A’

where I is a k - 1 dimensional identity matrix and r is the number of replications taken on the mixture. In the Appendix we show that (17) is equivalent to using eq 14 to estimate b for individual replicates and computing a pooled sample covariance matrix from the replicate estimates. EXPERIMENTAL SECTION Materials. The mixture constituents were the polyhexoses glycogen (GLY) and dextran (DEX) and a protein, bovine serum albumin (BSA) all of which were obtained from the Sigma Chemical Co. Concentrated suspensions of each compound in water were prepared prior to mixing at the correct concentration. Sixty-three mixtures were prepared by using proportions of constituents as illustrated in Figure 1. The mixtures were pi-

100%DEX

Mixture triangle showing the proportions of constituents in 63 experimental mixtures. The proportions vary in steps of 10%. Figure 1.

petted onto clean Curie-pointwires and the water was evaporated to leave a constant sample size of 50 p g . Pyrolysis Mass Spectrometry. Spectra were obtained with a Curie-point pyrolysis mass spectrometer (Pyromass 8-80, VG Gas Analysis, Ltd.,Middlewich, U.K.). This instrument is similar in concept to that designed by Meuzelaar et al. (18),except that it utilizes a magnetic mass analyzer. A full description of the Pyromass 8-80 can be found in Shute et al. (19). The key conditions used in this study were as follows: pyrolysis, 510 “C for 2 s; electron energy, 16 eV; scan time, 1.3 s/cycle; number of scans, 33; mass range, m / z 12-300. Mass spectra of three analyses of each pure constituent and mixture were recorded and the mass intensity data over the range m/z 51-154 were used for the statistical analyses. RESULTS AND DISCUSSION Since there are three replicates of the 63 mixtures and 104 variables, the within-mixtures dispersion matrix will be nonsingular unless some variables are collinear. The latter is found not to be the case, and therefore a CVA is possible. However we present results from combined PC-CVA for comparison. Estimates of the Number of Constituents, As explained before, the number of constituents with varying concentrations in the mixtures is one more than the number of major latent roots. Therefore the two major latent roots in Table I from CVA of 104 mass intensities indicate the presence of three constituents. In the case of CVA of the first eight principal components, accounting for 90.9% of total variance, the presence of three constituents is shown even more clearly. By contrast, the three major latent roots from PCA of the mean spectra indicate the presence of a fourth constituent or failure of the mixture model equation (1). Van de Meent et al. (20) have suggested that an extra component could occur because of chemical interaction between two or more constituents. This would be indicated by systematic change of the extra component values with changing concentrations of one or more constituents. No such systematic variation was observed and we conclude that the third PC is spurious and caused by giving equal weight to variables that are not associated with large

ANALYTICAL CHEMISTRY, VOL. 57, NO. 3, MARCH 1985

707

Table 11. Statistics for the Presence of Constituents in Mixtures of GLY, DEX, and BSA cosine of angleb

residual mean squares constituents mixtures

constitumixture analysis method ent

CVA

GLY

BSA PC-CVA'

GLY DEX

BSA PCA

10.800

8.53 14.46 46.59 (d.f. = 60) 4.902 2.685 1.959 (d.f. = 6) 0.2254 0.0865 0.1143 (d.f. = 60)

DEX

GLY DEX

BSA

(d.f. = 3600) 1.169 (d.f. = 372) 0.07028

F ratio

F prob

between test vector and mixture plane

0.79 1.34 4.31

P>O.lO P=0.04 P