Assessing Mixing Models within a Common Framework

Jun 26, 1996 - The problem of explaining a set of observations as mixtures of certain sources or end-members, occurs in many fields of science. An arr...
1 downloads 8 Views 372KB Size
Critical Review

Assessing Mixing Models within a Common Framework MARI-ANN AKERJORD† AND NILS CHRISTOPHERSEN* Department of Informatics, University of Oslo, P.O. Box 1080, Blindern, N-0316 Oslo, Norway

The problem of explaining a set of observations as mixtures of certain sources or end-members, occurs in many fields of science. An array of different data analysis techniques has been developed to tackle such problems within the various fields of application. Mathematically, mixing problems are similar, but the common framework is not always apparent because of the different assumptions and terminology specific to each discipline. Here, an attempt is made to present the common basis geometrically within a linear leastsquares setting. The focus is on comparing and assessing models and techniques, including subjective and potentially ill-posed elements of the analysis. Careful experimental design and the use of computer experiments are recommended as part of a rational approach to mixing problems.

1. Introduction In mixing problems, one tries to explain given samples as mixtures of certain sources or end-members. Such problems occur in many fields of science including: (1) Chromatography and spectroscopy, where the endmembers are the unknown chemical constituents and the observations from the mixture samples are the concentrations at certain retention times and absorbances at selected wavelengths, respectively (1-6). (2) Geology, where the end-members are parent minerals and the observations are chemical constituents of the rock samples (7-14). (3) Air quality studies, where the goal is to apportion pollution among potential sources based on measurements of the polluting species (15-24). (4) Hydrochemical studies of natural catchments, where the end-members are the soil water in chemically distinct soil horizons. The observations are concentrations of chemical species in streamwater, and the goal is to estimate * Corresponding author fax: +47 22852401; e-mail address: [email protected]. † Present address: Nycomed Imaging, P.O. Box 4220, Torshov, N-0401 Oslo, Norway.

S0013-936X(94)00672-3 CCC: $12.00

 1996 American Chemical Society

the amount of water contributed by each soil horizon to the stream (25-27). In the first areas, one seeks the mass or the concentration contributed by each end-member, while the problem in hydrochemistry is to estimate the amount of water (i.e., carrier) from the end-members. We denote the former problems as content-oriented and the latter as carrieroriented. The measurements or observations from the mixtures will be denoted variables. These are usually assumed to be sums of the corresponding end-member variablessappropriately scaled according to the endmember contributions. (The mixing is conservativesno reactions occur.) Mathematically, mixing problems are closely related but in practice they may be rather different, depending on the information available about the end-members. The simplest situation occurs when the end-members are known or assumed and the problem is to estimate their contributions to each observed mixture in the presence of experimental or other uncertainty. This is the direct analysis. [In air pollution studies, one uses the term chemical mass balance models (20).] At the next level, both the endmembers and their contributions are sought from the mixtures themselves, under certain assumptions and constraints. This is the inverse problem. There is a need for methods treating data at both levels, and an array of different multivariate data analysis techniques, including principle component analysis and factor analysis, is in use. The methods may be difficult to relate across disciplines because of the different terminology and assumptions employed. Furthermore, mixing problems frequently involve subjective decisions. As a consequence, the literature on mixing is difficult to penetrate. Comprehensive treatments do exist within air pollution (18,20,15), within chromatography/spectroscopy (4), and for geology (9, 12), but there still seems to be a need for an interdisciplinary overview emphasizing the common framework and potential pitfalls. Such an exposition is possible within a geometrical setting, considering mixing models as a set of linear least-squares problems. Inspired by Renner (12), an attempt is made here to provide a conceptually unified treatment of mixing models. The approach also allows one to draw on results from other areas employing linear least-squares methods. This points

VOL. 30, NO. 7, 1996 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

2105

FIGURE 1. Synthetic air pollution data taken from ref 19. (a) Profiles of the end-members representing sea salts, soil dust, and auto emissions, respectively. The 10 variables (i.e., chemical elements) are the mass fractions of each variable in each end-member. (b) One mixture profile is shown as a stacked bar diagram indicating the contributions (here 3 µg/m3) from each end-member.

is illustrated by noting that methods for treating observational errors, originally stemming from statistical signal processing, are useful also in the mixing context. Several practical aspects of mixing analyses are also treated, including the use of computer experiments to assess uncertainties in the calculations. However, no attempt is made to provide a “mixing manual”; the emphasis is on the geometrical approach, allowing a unified treatment. Hopefully, this will contribute to a better understanding of mixing models and lead to improved data analysis techniques in some areas. Some knowledge of linear algebra is helpful, but the main points are discussed in an intuitive geometrical fashion. To fix ideas, synthetic air pollution data from Henry and Kim (19) are used to illustrate salient points.

2. Mixing Models A natural starting point is to consider the end-members and mixtures as profiles (or spectra) by plotting the variables in succession. Figure 1 shows the end-members and one mixture for the air pollution data of Henry and Kim (19)

2106

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 30, NO. 7, 1996

where the variables are chemical elements. The three endmembers represent marine sources, soil dust, and auto emissions, respectively. The source compositions are given on a dry weight mass basis, i.e., microgram of chemical element/microgram mass of end-member. Since not all the chemical elements in each end-member are measured, the sum of the variables is less than 1. For the mixtures, the variables are given in microgram/cubic meter (µg/m3), implying that the end-member contributions are in terms of microgram of end-member per cubic meter of air. Considering a profile as a vector by letting each variable correspond to a vector element, a mixture is obtained from the end-members by scalar multiplications (i.e., multiplying each end-member by its contribution) and element by element vector additions. A mixture is therefore a linear combination of the end-members, which implies that the mixture and the end-members are linearly dependent. [Since the mixtures are obtained by adding the endmembers after multiplying them by non-negative numbers (the contributions), the mixtures are also so-called convex combinations of the end-members.] All other conservative mixing models, whether stemming from chromatography, geology, or hydrochemistry, may be viewed in terms of profiles and interpreted in this way. It is useful to emphasize two rather obvious constraintss sometimes called the natural constraintsscommon to all mixing models: (1) All end-members have only non-negative variables. (2) All contributions to a given mixture are non-negative. In the chemometrical literature, the end-member variables are usually called loadings whereas the end-members contributions are denoted scores. Redundancy arises if the end-members themselves are linearly dependent. Then a given mixture may be obtained from the end-members in several ways. We normally assume that this is not the case. By a basic result of linear algebra, the number of end-members can then at most equal the number of variables. In Figure 1, the dominant variables of each end-member occur only in minor amounts in the two others. The end-members are therefore almost orthogonal; strict orthogonality would require no variables in common. Orthogonal or nearly orthogonal endmembers are clearly linearly independent, i.e., it is impossible to obtain one by linear combinations of the others. Strictly speaking, linear independence is a question of yes or no, but it is possible to introduce a continuous scale from orthogonality (most independent) to linear dependency. In practice, the end-members have to be “sufficiently” independent. This is to avoid numerical difficulties in the direct situation and to secure that the true end-members become feasible solutions in the inverse case. Vector notation is now introduced by assuming p variables and representing a mixture by a p-dimensional vector x. Similarly, if there are k e p end-members, these may be represented by p-dimensional vectors, aj, (j ) 1, ..., k). Let the coefficients in the linear combination of the end-members (i.e., contributions) that produces x be f1, ..., fk. One then has k

x)

∑f a ) Af j j

(1)

j)1

where the columns of the p × k matrix A are the endmembers and the coefficients fj form the vector f.

FIGURE 2. Two end-members a1 and a2 (k ) 2) and a mixture x are visualized as vectors in three dimensional space (p ) 3) where each axis represents one variable. By natural constraint 1, the end-members lie in what corresponds to the positive quadrant in three dimensions (i.e., the positive orthant). By natural constraint 2, the mixture lies in the triangle C, spanned by the end-members.

A mixing problem may now be visualized geometrically. Figure 2 shows a hypothetical situation for p ) 3 and k ) 2. Natural constraint 1 simply means that the two endmembers a1 and a2 lie inside what is denoted the positive orthant, i.e., the part of space constrained by the positive coordinate axes. For p ) 3, this is an open pyramid. The end-members in Figure 2, in turn, define a plane, part of which is the triangle C between the end-members. A third end-member would be linearly dependent on the two others if it was in this plane and linearly independent outside it. The mixture x lies in the plane because it is linearly dependent on the end-members. By natural constraint 2, it is further limited to the triangle C. In higher dimensions, where the geometry is obviously more complicated, one works with “hyperplanes” but we will use the term “plane” in all cases. Regarding other conservative mixing models, the differences lie in how the end-member and mixture variables are scaled and whether one wants relative or absolute contributions. In air pollution, as we have seen, the sum of the variables is e1 for each end-member while the contributions are only constrained to be non-negative. In a carrier-oriented hydrochemical problem, on the other hand, the situation is in a sense reversed; the end- member variables only have to be non-negative, while the contributions are required to sum to 1 (29).

3. Direct Analysis Here, the end-members form the starting point, and the problem is to estimate their contributions to the observed mixtures in the presence of experimental and other

uncertainty. The basic requirement is that the endmembers are known well enough to be specified quantitatively. Note that a direct analysis may also be carried out as part of a hypothesis test. Different sets of end-members may be tried, and one may test how well each combination explains the mixtures. For given end-members, random errors would typically cause the mixture x in Figure 2 to lie outside the plane so that it is no longer a linear combination of the end-members. This situation is shown in Figure 3, which is redrawn from Figure 2. Geometrically, the least-squares approach computes a corrected or estimated mixture being the point in C closest to x. The estimated contributions are obtained by expressing this corrected mixture as a linear combination of the end-members. The standard least squares procedure computes x*, which is the point in the end-member plane closest to x (Figure 3a). If x* satisfies the second natural constraint and therefore lies in C, it is the desired solution. If not, a constrained least-squares optimization (constraining the solution to lie in C) is performed to obtain the point xest in Figure 3b. It is useful to consider the properties of x* in more detail. Since it lies in the end-member plane, we have x* ) Af* for some coefficients f*, which are the estimated contributions when x* is in C. (Then all elements of f* are non-negative.) To obtain f* (and x*), the length of the error vector e ) x p - x*, being (∑j)1 , ej2)1/2, is minimized. The result is the so-called normal equations f* ) (ATA)-1 ATx. Geometrically, x* is the orthogonal projection of the observed mixture onto the end-member plane (Figure 3a). Note that a multidimensional version of the Pythagorean theorem holds between the squared lengths of the vectors x, x*, and e: p

p

p

∑x ) ∑(x*) + ∑e 2

2 i

j

j)1

j)1

2 j

j)1

The ratio p

∑(x*)

2

j

0e

j)1

e1

p

∑x

2 i

j)1

may be taken as a measure of how well the given endmembers explain the mixture x (provided x* is in C). Since the error length (and its square) is minimized, no other point in C can give a larger ratio. If the lack of fit over the observed mixtures does not seem consistent with what is known about errors and uncertainties, there is reason to check the end-members. Also, different end-member sets may be assessed by comparing their overall fit. For the case where x is accurately observed close to the end-member plane but x* lies outside C, Renner (12) introduced an alternative approach where the endmembers are corrected. The end-members nearest to x* are modified by moving them “outwards” until the resulting set of corrected end-members spans the mixture projection in the appropriate way (Figure 3c). Related techniques have been introduced by Miesch (10) and Full et al. (7, 8). Given fixed end-members, several points are of practical concern for the direct analysis:

VOL. 30, NO. 7, 1996 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

2107

FIGURE 3. In panel a, the orthogonal projection x* of the mixture x onto the end-member plane lies within the triangle C spanned by the end-members a1 and a2. In panel b, x* is outside C, and the least-squares estimate is moved onto C, resulting in the point xest, which is the point in C closest to x. In special situations, the end-members may be modified as in panel c where a1 is replaced by a′1.

(1) End-members may be almost linearly dependent, resulting in normal equations that are numerically unstable, i.e., the required matrix inversion is ill-conditioned and small changes in x result in large changes in f*. (2) Uncertainties in the estimated contributions arise because of both observational errors in the mixtures and errors in the specification of the end-members. If the former problem dominates, f* is an unbiased, minimum variance estimate of f, provided the errors in the mixture variables are uncorrelated and have the same variance. If this is not the case, and the second-order error statistics (variances and covariances) are known or can be inferred, a weighted least-squares solution should be used, effectively scaling the data so that decorrelated errors with unit variances are obtained (e.g., refs 18 and 31). (3) If both mixtures and end-members are significantly error-corrupted, techniques presented in refs 18, 20, and 27 may be used. Note, however, that a general least-squares procedure (total least-squares) has been developed for this situation (30). This technique has led to improved results for mathematically similar problems within statistical signal processing (32, 33). Clearly, many factors influence the results of a direct analysis, and the critical aspects may vary from case to case. In practice, computer simulations, using synthetic data with known answers, are recommended as a simple way to “tailor” the uncertainty analysis to a specific situation. If error estimates exist (and in science they should), synthetic error-corrupted mixtures can be derived from assumed representative end-members. One can then study how the uncertainties propagate through the calculations by comparing f* and x* (and if necessary xest) with their true values. This can be done in many ways, one of which is shown in Figure 4. Assuming correct error statistics, this distribution reflects the uncertainty in the estimated contribution from the first end-member in Figure 1a to the mixture in Figure 1b. This is an example of a Monte Carlo simulation; the same basic calculation is performed many times (here 1000) with random errors drawn from the assumed distributions each time. Using synthetic data, one may test how well the mixing problem is posed. If the results for synthetic data are not considered good enough, there is little point in trying real data with similar properties. Before doing so, the problem must be reanalyzed taking the above discussion as well as

2108

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 30, NO. 7, 1996

FIGURE 4. Histogram based on 1000 simulations showing the distribution of the error in the contribution from the first end-member in Figure 1a to the mixture in Figure 1b. The variables of both the end-members and the mixture were corrupted by normally distributed errors with zero mean and 5% standard deviation. The true underlying contribution is 3 µg/m3.

problem specific-conditions into consideration.

4. Inverse Analysis In an inverse situation, one would typically known something about potential end-members but not enough to specify the number and quantify the compositions. For example, the end-members could be known to be basically orthogonal (as in Figure 1), but one would be unable to give the number of sources and the relative amounts of the variables. The mixtures then form the starting point of a two-step procedure where the end-member plane is first estimated using principle component analysis (PCA; see, for example, refs 5 and 9). With error-corrupted observations, this is usually a non-trivial task. Corrected mixtures are obtained by projecting the observations onto the plane. In the second step, the end-members are sought in the plane. As will be seen, the natural constraints only provide a feasible end-member region, and the more qualitative and problem-specific information is used to further limit the possibilities. As in the direct case, one may carry out an inverse analysis as part of a hypothesis test. The question is then

whether or not the observations can reasonably be explained as mixtures of plausible end-members. In hydrochemistry, for instance, this would be a natural approach as many other processes in addition to mixing may potentially affect the soil water on its way to the stream. A critical attitude is necessary when working in this mode because plausible end-members may result from the analysis, even if underlying processes other than mixing give rise to the observations. The results should therefore be thoroughly assessed. A testing procedure adapted to hydrochemistry is suggested in ref 25. 4.1. End-Member Plane. The situation may be visualized for p ) 3 by considering Figure 2 or Figure 3a and imagining a cluster of mixtures, xi (i ) 1, ..., n), more or less close to the plane defined by a1, and a2. Given only the mixtures, one would not know a priori whether there is indeed a plane (two end-members) or a line (one endmember forming a “one-dimensional plane”) that is correct. To find the right dimension as well as the orientation of the plane/line, it is clear that the number of points, their distribution in space, and the associated errors all play a role. PCA may be explained by starting from the assumption of a single end-member, seeking the one that is best in a least-squares sense, considering all mixtures simultaneously. This is the first principle component (PC) vector or axis u1. [Grouping all mixtures as columns into p × n matrix X (p < n), the first principle axis is formally the eigenvector of XXT, corresponding to the largest eigenvalue.] Geometrically, it is a line through the origin passing through the cluster of mixtures as indicated in Figure 5a. As such, it resembles but is not identical to a regression line through the origin. All mixtures are projected onto the line, and in the optimum position as determined by PCA the sum of the squared distances from all mixtures to the line is minimized. The projection x*, indicated by an asterisk (*) in Figure 5a, is then a first approximation to the mixture x. For the projection of mixture xi we then have x*i ) v1iu1 for some (scores) v1i (i ) 1, ..., n). Defining the error ei ) xi - x*i, PCA leads to a generalized Pythagorean theorem: n

( ) ∑(∑ ) ∑(∑ ) p

∑∑

n

p

n

j)1 i)1

p

(x*ij)2 +

xij2 )

j)1 i)1

eij2

j)1 i)1

where the last term (the sum of squared error lengths) is n p minimized. Here, ∑j)1 (∑i)1 xij2 ) is the sum of squared lengths of all mixtures, which is denoted the total variation n p in the observations. The term ∑j)1 (x*ij)2, which is ∑i)1 maximized, is the variation explained by the first principle axis. Usually, one considers the normalized ratio

( ) ∑ (∑ )

n

p

∑ ∑(x*)

2

ij

0e

j)1 i)1 n

p

e1

xij2

j)1 i)1

as a measure of the total variation explained. The second PC, u2, is determined by considering all errors ei and finding the best approximations, v2iu2, to these. (Since all errors are orthogonal to u1, u2 and u1 are also orthogonal). The best two-dimensional plane passing through the data cluster is formed by u1 and u2 and contains the improved mixture approximations x*i ) v1iu1 + v2iu2, which is now the

FIGURE 5. Two steps of a PCA. In panel a, the first principle axis u1 is indicated together with the mixture cluster (dots). The orthogonal projection of the mixture x onto u1 (star) is also shown. The mixtures should be imagined in three dimensions around u1. In panel b, the two first orthogonal principle axes u1 and u2 define a plane onto which the mixtures are projected (stars). The triangle between the two broken lines defines the border of the feasible end-member region, C, by natural constraint 1. Two potential endmembers, a* 1 and a* 2, also satisfying the second natural constraint, are shown.

orthogonal projection of xi onto the plane, explaining a larger part of the variation in the data. The situation is illustrated in Figure 5b. By considering the errors in this approximation, a third PC may be derived. A total of p PCs may be derived, defining a family of planes of dimension from 1 to p, approximating the data cluster more and more accurately. The p-dimensional plane will explain all the variation since the mixtures xi are themselves p-dimensional. However, with largely errorfree mixture data generated from k end-members, the first k PCs suffice to explain essentially all the variation. This is the basic observation underlying the first step of the inverse analysis. By considering the variation explained as a function of the number of PCs, the correct dimension of the end-member plane may be inferred. In practice, care must be taken if the mixture variables span different numerical ranges. This may, for example, be the case if different units are used. Considering the total variation in the data, the smaller scale variations may be at the noise level of the larger scales and therefore tend to be neglected. To condition the problem better in this sense, the standard procedure applied in mixing analysis is to give each variable equal weight by centering (the origin is moved to the mean of all variables) followed by a scaling with the standard deviation. Figure 6a shows the result of doing this for the error-free synthetic air pollution data. One notes that three PCs correctly explain all the variation.

VOL. 30, NO. 7, 1996 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

2109

FIGURE 6. Percentage of the variation explained in the synthetic air pollution data of Henry and Kim (k ) 3) as a function of the number of PCs retained. The variables have been centered and scaled by their standard deviations. No errors in panel a; in panels b and c, normally distributed errors were added with standard deviations equal to 25% and 50%, respectively.

When errors in the observed variables play a role, the mixtures show non-negligible variations in more than k directions (cf. Figure 6b,c). The user then has to strike a balance. If, on the one hand, a plane of too high dimension is selected, some of the variation explained is simply noise, and false end-members will subsequently be introduced. On the other hand, if the dimension is too low because part of the true variation is falsely considered as noise, some

2110

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 30, NO. 7, 1996

end-members will not be detected. Many rules have been suggested to aid in deciding the correct dimension (5, 9, 20). In the end, however, a subjective decision must be made. With the noise level in Figure 6b, an experienced analyst may well decide for three PCs, giving the correct number of end-members. However, the estimated plane may well be “tilted” to some degree with respect to the true plane. In Figure 6c, the noise level is too high to allow a correct determination of the number of end-members; this may, of course, also be the situation in a real case. A pretreatment to PCA is possible if the second-order error statistics are known. Then the errors in each variable can be decorrelated and the variances standardized as in the direct case. This technique, applied within statistical signal processing, facilitates the dimensional problem and leads to a PCA plane whose direction is statistically unbiased (32, 33). The point is simply that uncorrelated measurement errors form a “ball” without any preferred direction. The structure picked up by the PCA is therefore due to the mixing. Correlated errors, on the other hand, superimpose a structure on the data that may confound the analysis. 4.2. Estimating the End-Member Profiles. 4.2.1. Feasible End-Member Region. Assume now that a PCA plane, considered to explain a sufficient part of the variation in the data, has been determined in the first step. The mixtures xi are now replaced by their orthogonal projections onto this plane, forming corrected mixtures on which the rest of the analysis is based, cf. Figure 5b. Unless k ) 1, the principle axes cannot be used as endmember candidates. The basic problem is illustrated for k ) 2 and p ) 3 in Figure 5b. End-member candidates (such as a*1 and a*2) must only have positive variables (natural constaint 1), and they must span the mixture approximations in the appropriate way (natural constraint 2). In Figure 5b, u2 lies outside the open triangle between the broken lines, indicating the feasible end-member region by the first natural constraint. The picture in Figure 5b is valid qualitatively for arbitrary p as long as k ) 2. This was shown by Lawton and Sylvestre (3), who denoted their technique “self-modeling curve resolution”. (They worked within spectrophotometry.) The trick is to consider the k scores and not the p variables and to ask for all combinations of scores producing endmembers satisfying both natural constraints, given the two fixed p-dimensional principle axes from step one. Geometrically, it turns out that the set of valid scores, considered as points in the plane, is constrained to lie between an inner and outer region as the end-members in Figure 5b. In general, given k end-members, the feasible region can be considered in the k-dimensional score plane and not the p-dimensional variable plane. Several investigators have characterized the feasible region for k ) 3 (1, 2, 19). A recent survey, including computational algorithms, may be found in ref 6. However, for k g 4, this has turned out to be a difficult problem that to our knowledge is not yet solved in general. We will briefly indicate the situation for k ) 3 using the synthetic air pollution data in Figure 1. In this case, the end-members and the mixtures lie in a three-dimensional plane embedded within a 10-dimensional variable plane. The situation in the three-dimensional score plane is illustrated in Figure 7. The figure shows two pyramids with a common top opening toward the reader. The outer pyramid represents the first natural constraint, and any set of valid end-member scores has to lie on or inside this

FIGURE 7. Two pyramids with a common apex. The outer pyramid forms the feasible score region for the end-members in the synthetic air pollution data set of Henry and Kim. The inner pyramid is formed by the three true end-member scores and the mixture scores (dots) lie within this pyramid.

pyramid. To comply with the second natural constraint, the pyramid formed by any set of three feasible endmembers must circumscribe the mixture scores. In Figure 7, the inner pyramid is formed by the three true endmembers, and the mixture scores are represented by the dots. Returning to the general inverse situation, two points may be noted. First, assume that the observations are indeed mixtures of some underlying end-members and that the plane they define has been accurately estimated by the PCA. The natural constraints are then in general not sufficient to define any unique solution. For example, knowing only the mixtures and the outer pyramid in Figure 7, it is clear that many combinations of three potential end-members would satisfy the natural constraints. Second, if processes other than conservative mixing caused the structure in the data leading to the PCA plane, the data could still be explained as mixtures with a multitude of possible end-members. To limit the options and to allow an assessment of the remaining possibilities, the auxiliary or qualitative information is brought into play. Three such approaches will be described. 4.2.2. Assessing Separate End-Member Candidates. In this approach, one has available potential end-members not included in the PCA stage. This will be the case, for example, if end-member profiles or spectra are taken from a library. One might also wish to exclude certain data from the PCA to test them separately as end-members. Two rather similar procedures have been developed independently to treat such situations; one is the endmember mixing analysis (EMMA) proposed by Christophersen and Hooper (25) for the carrier-oriented case, while the other is the target transformation factor analysis (TTFA) described by Hopke (20) for the content-oriented situation. In both cases, one projects the candidates onto the estimated end-member plane. These projections form a corrected set of candidates, consistent with the estimated plane. As a first screening test, one should check the distance of the original candidates from the plane. The ones that are not close would seem questionable at the outset (25). The next step is to consider the candidate projections (or their scores) to see which combinations, if any, satisfy the natural constraints. If no feasible combination appears,

the candidates can be discarded. If more than one combination is feasible, the investigator has to assess the options using problem-specific information. 4.2.3. Utilizing Extreme Mixtures from the PCA Stage. Here, the idea is that some of the mixtures included in the PCA stage are close to the real end-members. These mixtures do not have to be known in advance. Renner (12) presented an approach aimed at geological applications, where k extreme observations in the end-member plane are chosen initially. If they span all the other observations in the appropriate sense, they become the end-members. If not, they are moved outwards in the plane until all the other mixtures become interior points. The solution depends on the initial choice of extreme mixtures, and Renner also presented two slightly different algorithms that do not produce identical solutions. Similar methods have been developed by Miesch (10). Here, the principle axes may be rotated to coincide with specified mixtures. If these form a set of k “extreme” observations (10), the resulting end-members are feasible. In a related approach (7, 8), the scores are moved outwards until all the remaining mixtures are enclosed. This technique is very similar to the ones described by Renner. 4.2.4. Orthogonal and “Simple” End-Members. Assume that the end-members are basically orthogonal and therefore comprise only a limited number of dominant and mutually exclusive variables as in Figure 1. Such endmembers are said to have a “simple” structure. The Varimax approach, originally developed for analysis of psychological data (34), is the standard technique used to derive such end-members from the (orthogonal) principle axes by “rigid” rotations. This technique is frequently applied to inverse problems in air pollution and geology (7-10, 2123). An end-member lies on the outer boundary defined by the first natural constraint if and only if it has at least one variable (loading) equal to zero, cf. Figure 7. Orthogonal end-members satisfy this condition. Varimax then tries to rotate the principle axes and place them at the outer boundary. One may envision future generalized techniques directly locating feasible solutions at this boundary. Such methods would not necessarily be based on strict orthogonality and could offer the data analyst a wider range of criteria from which to derive simple solutions. However, given the difficulties in characterizing the feasible region in dimension 4 and higher, such an approach needs more mathematical study.

5. Discussion The various fields employing mixing models have often developed their own techniques starting from different assumptions. However, within a geometrical framework, a more unified approach becomes possible, allowing the relationships between the various mixing models and the associated data analysis techniques to be recognized. It is important to carry out both direct and inverse mixing analyses with caution as several issues are of concern in securing meaningful results. This is particularly important for inverse problems where the subjective elements of the analysis may be considerable. Seemingly plausible endmembers may be found even if the observations do not result from mixing. Henry (16) discusses some of these issues in connection with air pollution apportionment. As a general guideline to inverse problems, it is important to have as much information about the end-members as

VOL. 30, NO. 7, 1996 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

2111

possible. Methods based on extreme mixtures or endmember candidates explicitly assume potential source estimates to be available. If one has control over the measurement program, the experimental design should if possible be used to obtain observations close to the endmembers. At the practical level, we have emphasized the role of computer experiments using synthetic data. Analysis of real mixtures, as multivariate data analysis in general, involves decisions including weighting, scaling, centering, outlier removal, and the number of PCs to retain. With only insignificant errors, these issues may not be of great concern, but with more realistic and error-prone data they may be critical. Given some estimate of the input errors (which should be available in scientific problems), synthetic error-corrupted mixture data can be derived from plausible sets of end-members. It is then in principle a simple, albeit computer intensive, exercise to investigate how the input errors propagate through the calculations by performing many simulations. Such investigations may also be used to test different experimental designs. Synthetic data are not new in the mixing context (9, 16); our point is that these techniques should be used systematically and routinely. Such methods provide in a sense necessary tests. Thus, if one is not satisfied with results for synthetic data where the answers are known, one should not proceed to real data with similar properties.

Acknowledgments Support for M.-A.A. from the Norwegian Research Council is greatfully acknowledged.

Literature Cited (1) Borgen, O. S.; Kowalski, B. R. Anal. Chim. Acta 1985, 174, 1-26. (2) Borgen, O. S.; Davidsen, N.; Mingyang, Z.; Øyen, Ø Mikrochim. Acta 1986, 2, 63-73. (3) Lawton, W. H.; Sylvestre, E. A. Technometrics 1971, 13, 617-632. (4) Malinowski, E. R.; Howery, D. G. Factor analysis in chemistry; Wiley: New York, 1980. (5) Martens, H.; Næs, T. Multivariate calibration; Wiley: New York, 1989. (6) Ukkelberg, A° . Approaches to the component resolution problem. Ph.D. Thesis 46, The Norweigian Institute of Technology, Trondheim, Norway, 1994, 260 pp. (7) Full, W. E.; Ehrlich, R.; Klovan, J. E. Math. Geol. 1981, 13, 331344. (8) Full, W. E.; Ehrlich, R.; Bezdek, J. C. Math. Geol. 1982, 14, 259270.

2112

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 30, NO. 7, 1996

(9) Jo¨reskog, K. G.; Klovan, J. E.; Reyment, R. A. Geological factor analysis; Elsevier: Amsterdam, 1976. (10) Miesch, A. T. Comput. Geosci. 1976, 1, 147-159. (11) Miesch, A. T. Math. Geol. 1980, 12, 523-538. (12) Renner, R. M. On the resolution of compositional datasets into convex combinations of extreme vectors; Technical Report 88/ 02; Institute of statistics and operations research, Victoria University: Wellington, New Zealand, 1988; 48 pp. (13) Renner, R. M.; Jurke, S. R. Math. Geol. 1992, 24, 287-303. (14) Renner, R. M. Appl. Stat. 1993, 42, 615-631. (15) Gordon, G. E. Environ. Sci. Technol. 1988, 22, 1132-1142. (16) Henry, R. C. Atmos. Environ. 1987, 21, 1815-1820. (17) Henry, R. C. Atmos. Environ. 1992, 26A, 933-938. (18) Henry, R. C.; Lewis, C. W.; Hopke, P. K.; Williamson, H. J. Atmos. Environ. 1984, 18, 1499-1506. (19) Henry, R. C.; Kim, B. M. Chemom. Intell. Lab. Syst. 1990, 8, 205216. (20) Hopke, P. K. Receptor modeling in environmental chemistry; Wiley: New York, 1985. (21) Li, S.-M.; Winchester, J. W. Atmos. Environ. 1989, 23, 23872399. (22) Thurston, G. D.; Spengler, J. D. Atmos. Environ. 1985, 19, 9-26. (23) Van Borm, W. A.; Adams, F. C.; Maenhaut, W. Atmos. Environ. 1990, 24B, 419-435. (24) Wang, D.; Hopke, P. K. Atmos. Environ. 1989, 23, 2143-2150. (25) Christophersen, N.; Hooper, R. P. Water Resour. Res. 1992, 28, 99-107. (26) Christophersen, N.; Neal, C.; Hooper, R. P.; Vogt, R. D.; Andersen, S. J. Hydrol. 1990, 116, 307-320. (27) Hooper, R. P.; Christopherson, N.; Peters, N. E. J. Hydrol. 1990, 116, 321-343. (28) Akerjord, M.-A. A geometrical analysis of mixing models (In Norwegian). Master Thesis, University of Oslo, Oslo, Norway, 1993, 111 pp. (29) Akerjord, M.-A.; Christophersen, N. The geometry of end-member mixing and unmixing: Well-posed and ill-posed problems; Research Report; Department of Informatics, University of Oslo: Oslo, Norway, 1993. (30) Bjo¨rck, A° . Least-squares methods. In Handbook of numerical analysis, vol. 1; Ciarlet, P. G., Lions, J. L., Eds.; Elsevier: Amsterdam, 1990; 199 pp. (31) Draper, N. R.; Smith, H. Applied regression analysis; Wiley: New York, 1981; 709 pp. (32) Rao, B. D.; Arun, K. S. Proc. IEEE 1992, 80, 283-308. (33) Scharf, L. L. Statistical signal processing; Addison Wesley: Reading, MA, 1990; 524 pp. (34) Kaiser, H. F. Psychometrica 1958, 23, 187-200.

Received for review October 31, 1994. Revised manuscript received January 16, 1996. Accepted March 27, 1996.X ES940672N X

Abstract published in Advance ACS Abstracts, May 1, 1996.