Environ. Sci. Technol. 1994, 28, 1037-1047
Heterogeneity Analysis for Binding Data Using an Adapted Smoothing Spline Technique Maarten M. Nederlof,tg* Willem H. van Riemsdijk,t and Luuk K. Koopal’g§
Wageningen Agricultural University, Department of Soil Science and Plant Nutrition, P.O. Box 8005, 6700 EC Wageningen, The Netherlands, and Department of Physical and Colloid Chemistry, P.O. Box 8038, 6700 EK Wageningen, The Netherlands Heterogeneity analysis is a helpful tool to select a proper model for the description of ion binding to polyfunctional ligands. Two approaches of heterogeneity analysis are discussed: the local isotherm approximation (LIA) method and the differential equilibrium function (DEF) method. For both methods, the approximate distribution function of a given ligand system is a series of derivatives of the experimentally obtained binding function. To obtain reliable derivatives, a smoothing spline routine is adapted for the present problem. The smoothing parameter of the spline is determined by a generalized cross-validation criterion in combination with physical constraints derived from the binding function. With the thus obtained spline function, the distribution is calculated. Error bars for the obtained distribution function can be calculated using the variance in the spline function. The error bars indicate whether peaks in the distribution are significant. The methodology is applied to a synthetic data set to illustrate its capabilities and limitations and is applied to copper binding to humic materials (data set of Hansen et al.) to illustrate its use in practice. The quality of the calculated distribution function depends on the experimental error in the data, the number of data points, and the binding range. On the basis of the calculated distribution function, a binding model can be selected.
Introduction The binding of chemical species to polyfundional ligands is a much studied subject (1-5) because it affects, for instance, the mobility and bioavailability of species in the natural environment. Polyfunctional ligands possess different functional groups that have different affinities for the species that may bind to these ligands. The different functional groups constitute an “intrinsic heterogeneity”. Apart from this intrinsic heterogeneity, other factors such as electrostatic interactions and competition with protons and particle conformation (2, 5 ) may also affect the binding behavior a t a given concentration in solution. Only in favorable cases is it possible to discriminate between the intrinsic heterogeneity and “secondary” effects. For proton adsorption to humic and fulvic acids, it was shown that the electrostatic effects can be accounted for separately (6-10). In many cases, however, the separation of interactions is not possible. Modeling of binding to heterogeneous ligands is often done a priori by assuming a type of binding function, followed by fitting the model parameters. Munson and Rodbard (11)have extended the Scatchard analysis (12) method to multiple discrete site binding. Perdue (13)used
* Corresponding authors e-mail:
[email protected]. t Department of Soil Science and Plant Nutrition. t Present address: KIWA N.V., P.O. Box 1072, 3430 BB Nieuwegein, The Netherlands. 5 Department of Physical and Colloid Chemistry.
0013-936X/94/0928-1037$04.50/0
0 1994 American Chemical Society
a Gaussian distribution to model metal ion binding to humic material. Since different models for the distribution function lead to equally good descriptions for a given set of binding data, this is not a very satisfactory approach. Analyzing the binding in terms of a distribution function of affinity constants without a priori assumptions with respect to the type of distribution is more satisfactory. Once the distribution is obtained, an appropriate binding model can be selected. Several methods, both purely numerical and semianalytical, have been developed to study the heterogeneity of polyfunctional ligands or adsorbents on the basis of binding or adsorption experiments. The main objective of these methods is the calculation of a distribution function of affinity constants. Good results for gas adsorption studies have been reported with singular value decomposition (SVD) (14)and regularization (15,16).Also the Fourier transform technique (17,181has been used to obtain the distribution function. In all cases, constraints have to be used in order to obtain results free of spurious oscillations. In this study, we focus on the semianalytical methods that find an analytical expression for the distribution function. Nederlof et al. (19-21) analyzed methods of several authors and found that all these methods can be interpreted as so-called local isotherm approximations. With the semianalytical methods, the distribution is obtained by taking the first and higher derivatives of the binding curve with respect to the concentration of the species that binds. Taking derivatives of experimental data without preprocessing the data easily leads to spurious peaks (22-24) resulting from experimental errors. Thus, also with semianalytical methods, constraints for the data processing should be specified, and this is the issue of this paper. One of the classical methods that addresses the problem of spurious peaks due to experimental error is the affinity spectrum (AS) method (25-27). The AS method includes aparameter “a” that should be able to suppress the effects of experimental errors. However, several authors have shown (20,22-24)that spurious peaks may still result. In order to overcome this problem, Fish et al. (23)have used a cubic smoothing spline to describe the data and have applied the AS technique to the smoothed data. In principal, spurious peaks could be smoothed away by increasing the smoothing parameter, but according to the authors, there was no sound basis to decide which smoothing parameter was appropriate. They therefore concluded that the AS method was not reliable. Turner et al. (24) came to a similar conclusion. In this paper, an adapted smoothing spline method is presented that, when applied to the experimental data, removes all inflections that lie within the range of experimental error. This does not mean that these inflections cannot be realistic but that within the range of experimental error no distinction can be made between inflections resulting from experimental errors and inflecEnvlron. Sci. Technol., Vol. 28, No. 6, 1994
1037
0.4
-4.0
-5.0
0.3
2
g-6.0 0.2
-
Y
cn
v
2 -7.0
Y-
o. 1
-8.0
0.0
-9.0 0
1
2
3
4
5
6
7
8
9
-5.5
-6.5
-4.5
-3.5
logK 0.4
r--
i
0.4
I
i TI
0.3
I
\-/
2
.*
Y
2 0.2
r"
* *
/I
Y
.-2
0.1
.IC1 c
0.0 -9
0.1
+ .
0.0
-a
-7
-6
-5
-4
-9
-8
-7
. *
I -6
-5
-4
Flgure 1. Synthetic data set. (a) Normalized distribution function used to calculate the synthetic binding curve. The dashed line indicates the range of affinities that corresponds with the data presented in panel b. For the parameters of the distribution, see Table 1. (b) The basic data (log[M] as function of (M,)) on which an error is imposed with = 0.01 are indicated by the symbols; and the solid line indicates the exact curve. (c) Amount of M bound obtained on the basis of the data of panel b (symbols); the solid line represents the exact curve. Both are normalized with respect to (S,].(d) The first derivative of the binding curve (d(Mb)ldlog[M]; the solid line indicates the exact curve.
tions caused by functional groups. The problem of finding a proper smoothing parameter is tackled by applying the generalized cross-validation (GCV) technique (28,29)in combination with physical constraints that should be fulfilled. Once a proper smoothing parameter is determined, it is possible to represent the data with a spline and to calculate the distribution function on the basis of that spline. In addition, from the variance in the spline, the uncertainty in the distribution function can be indicated by confidence intervals. The latter indicate whether peaks found are significant. To test and illustrate the method, the whole procedure is applied to both a synthetic and an experimental data set. Experimental data were taken from Hansen et al. (30),who measured copper binding to humic material. On the basis of the calculated distribution function, an adsorption model can be selected.
ment, we will assume that the data set describes metal ion binding to a polyfunctional ligand. The LangmuirFreundlich equation is given by
where is the mean affinity of the distribution, m is a measure of the width of the distribution (0 C m < l),[MI is the concentration of free M, (Mb) is the amount of M bound, and (Stl is the total concentration of sites (both bound and free). By analyzing eq 1,Sips (31)has found that eq 1 describes binding to a heterogeneous ensemble of sites and that the affinity distribution corresponding with eq 1 is a semi-Gaussian equation given by
Data Sets
A synthetic data set based on a known distribution function is used to test the methods. This has the advantage that the results obtained can be compared with the true distribution. The data set measured by Hansen et al. (30) is used to illustrate the practical applicability of the methods. Synthetic Data. The binding function underlying the synthetic data set is given by a Combination of three Langmuir-Freundlich equations. For the sake of argu1038
Environ. Sci. Technol., Vol. 28, No. 6,1994
Figure l a shows the trimodal distribution for the parameters given in Table 1. By selecting a broad peak in combination with two narrow peaks, a critical evaluation of the analysis methods is possible (19-21). Using the analytical expression for the binding function, 100 data points (log[M], (Mb)) were generated equally spaced on the log[M] scale in the range from log[M] = -9 to log[Ml = -4. This means that a t the highest concentration no saturation of all the sites is reached. Thus,
Table 1. Langmuir-Freundlich Parameters of the Synthetic Data Set 1 2
3 0
fraction of (S#
log ii
m
0.925 0.050 0.025
3.0 6.0 8.0
0.50 0.95 0.95
-4.0
1
-5.0
-@
s-6.0 Y
P)
2-7.0
{St) = 2 x 10-4 mol/L (total site concentration).
-8.0
: -
1,
I
-9.0 i
-7
0.0 4
-5
-6
-4
[I 5
7
6
8
9
logK Flgurs 2. Affinity spectrum obtained from the synthetic data set when no preprocessing of the data is carried out. The dashed line indicates the true distribution function, correspondingto the window in Figure l a .
only apart of the real distribution (indicated by the dashed line in Figure l a ) can be found with the heterogeneity analysis. It is quite common that only such a “window” of the distribution can be obtained. In fact, our synthetic data set is chosen to resemble the dataset of Hansen et al. (30). Often [MI is measured by an ion-selective electrode (ISE), and the measured points are potentials (mV) as a function of added metal ion solution. In that case, the experimental results will consist of log[Ml data as a function of {Mt), the total metal ion concentration. Let us assume that the error in (Mt)will be negligible and that the absolute random error in log[Ml is constant with a standard deviation of Q = 0.01, which corresponds to a relative error of 2.5 % in [MI. With a random generator, this error was imposed on log[M]. The error in log[Ml also shows up in the binding curve {Mb)(log[Ml). Figure l b shows the obtained “basic data”, log[Ml as a function of log {Mt). The error in log[Ml is hardly visible on this scale. However, when the data are replotted as a binding curve, that is bound M, ((Mb) = (Mt) - [MI), as a function of log[M] (Figure IC),it appears that for high concentrations the experimental error becomes visible. This is due to the small difference between {Mt)and [MI. The solid line represents the exact binding curve. Values of {Mb] were normalized with respect to (St}. Thus, processing the data into a binding curve enlarges the error. An extensive study on the propagation of error is given in ref 22. The errors become more clear when the first derivative of the binding curve with respect to log[M] is calculated (numerically), see Figure Id. Now also for lower concentrations, deviations from the solid line become visible. The effects will become worse for higher derivatives. This is illustrated in Figure 2 where the (secondorder) affinity spectrum ( a = 0.2) (26) is shown: Large spurious peaks occur for low affinities, corresponding with high concentrations, and for high affinities, a rather smooth distribution is obtained. The AS method was applied
-9
-8
-7
-6
-5
-4
Figure 3. Experimental data set of Hansen et al. (30). (a) Basic data (symbols); the solid line indicates the spline calculated for these data. (b)The binding curve calculated on the basis of the spline function and the points calculated directly from the basic data. The binding curve is normalized with respect to DOC (22.2 mg/L); the amount ofmetal bound Is expressed in mmol/g of DOC.
directly after transformation of the basic data to a binding curve without further data processing. This was possible since the data were generated equidistantly on a log[Ml scale. From these examples, it becomes clear that a type of preprocessing of the data is needed before the heterogeneity analysis methods can be applied.
Experimental Data The second data set consists of measured data taken from Hansen et al. (30),who studied copper binding to aquatic humic material. Filtered water samples from three coastal lagoons in Mexico were titrated with a copper solution a t pH = 6 and 25 “C. The cupric ion activity was measured with an ion-selective electrode. In this way 36 data pairs, logCM1 as a function of total copper added, (MtJ,were obtained (see Figure 3a). Figure 3b shows the corresponding binding function that is normalized with respect to DOC (22.2 mg/L). When we assume that the variance of the random error in the measured electrode potential is constant and that the calibration curve of the electrode is linear over the whole measured range, then the variance in the error in log[Ml is also constant as was the case with the synthetic data set. The solid curves and the error bars in Figure 3a,b are the result of the spline analysis to be discussed later. A t most concentrations the error bars coincide with the symbols. Environ. Sci. Technol., Vol. 28, No. 6, 1994
1039
Heterogeneity Analysis
Integral Binding Equation. The binding of a chemical species to a heterogeneous ligand with a total concentration of sites ( S t ) is considered as the sum of the contributions of the binding to each site type. When the binding of a species M to a specific site Si may be formulated as
Si + M
-
SiM
very wide distributions. However, narrow peaks are always flattened; this flattening is most pronounced for the CA and the least for the LOGA-1. Differential Equilibrium Function Method. In the DEF method (33,34),the law of mass action is applied to the whole ligand system, and a differential equilibrium function, KDEF,is defined. The expressions for the DEF distribution function and KDEFare (21, 35)
(3)
and when the distribution of affinities is continuous, the overall degree of binding, Ot, can be described as
Bt =
SABf(1og K ) dlog K
(4)
where 0 is called the local binding function or local isotherm, f(1og K ) is the distribution function, and A is the range of possible log K values. For the local isotherm, often a Langmuir type of equation is used. The overall degree of binding 0t is equal to (Mbj/(St). Finding the distribution function implies the inversion of eq 4. This inversion is not easy because eq 4 is a Fredholm integral equation of the first kind (32). In the semianalytical methods, several approaches can be distinguished to invert eq 4. The first group is that of the local isotherm approximation (LIA) methods (19,20). The affinity spectrum methods (1,26,27)are special cases of LIA. The second type of approach is the differential equilibrium function (DEF) method (4,34,35). A discussion of the DEF method in relation to the LIA methods is presented in ref 21. Local Isotherm Approximation Methods. The simplest LIA method replaces the local isotherm by a step function, this leads to the condensation approximation (CA) of the distribution function (36, 37): d9t = dlog[Ml
(5b)
log K C A = -log[M]
A first improvement of the CA method (19,38)replaces the local isotherm by a linear function. Nederlof et al. (19) present the LINA distribution function as: d26,
d4
fLINA
= dlog[M3- 0.43
dlog[M12
log K L I N A = -log[M] - log 0.5
(6a) (6b)
The third and best approximation of the local isotherm is the logarithmic symmetrical approximation (LOGA). The expression for the LOGA distribution is (19)
log K,o,A
= -log[Ml
(7b)
In literature, eq 7 has been reported with different values of p (19, 39). The best approximation of the Langmuir equation is found for /3 = 0.7; we will indicate the resulting distributions as LOGA-1. The LOGA distribution with /3 = 1 is equivalent with the second-order affinity spectrum method (a = 0) and will be called LOGA-AS. An illustration of the potentials of the methods can be found in refs 19 and 21. In general, the methods work well for 1040
Environ. Sci. Technol., Vol. 28, No. 6, 1994
An important difference between the CA or LOGA and the DEF method is that with the DEF method the log K axis is not simply obtained by transforming the log[Ml scale into a log K scale by log K = log[Ml, but that it is a weighted function of the log K values present. An illustration of the potentials and weaknesses of the DEF method is given in refs 21 and 35. Accessibility of the Distribution Function. Although eqs 5-8 suggest that for the determination of a distribution function the total concentration of sites (St) should be known, this is not the case. When Bt is replaced by {Mb)/{st),it follows that a nonnormalized distribution function can be obtained as:
where F(1og K ) is the nonnormalized distribution function. The shape of the distribution function is not affected by the fact that the distribution is not normalized. In theory, the quality of the distribution function obtained depends on the capabilities of a given heterogeneity analysis method (21). However, in practice also the accuracy of the experimental data is extremely important. In order to obtain reliable results, some smoothing of the data will be required. An optimal procedure for the data treatment should indicate whether it is justified to use higher order derivatives. In principle, a smoothing spline technique is well suited to do this. To determine the appropriate amount of smoothing, a statistical optimization procedure in combination with physical constraints will be used. Although this leads to a somewhat complicated procedure, the great advantage is that the smoothing is standardized. Smoothing Spline Technique
Principle of Smoothing Spline. The smoothing spline method is well studied to describe the data because no a priori assumptions are made about the functionality underlying the data (28, 40, 41). An ordinary spline connects all the data points themselves, that is, all the data points lie on the spline function. For noisy data points ( x i , yi,) this is not realistic, since there is a range of uncertainty around them. A smoothing spline takes this into account by fixing only the xi values whereas the yi values are assumed to contain a normally distributed random error. It is assumed that the xi data represent the independent variable, in our case (Mt). The degree of smoothing is determined by the smoothing parameter, p , which regulates the trade off between the goodness of fit and the smoothness of the curve.
For a given smoothing parameter,p, the spline function, indicated by sp(xi),is obtained by minimizing the function C, (29, 42):
-3.0
-4.0
kS5.0 0
where n is the number of data points, wi is a weighting factor, yi(xi) are the data points, s;)(x) is the vth derivative of the spline function, and A represents the x-range. The goodness of fit is calculated as the weighted sum of squares of the differences between the smoothing spline and the measured values yi(xi). The integral in of eq 10 is the smoothness of a vth-order spline. The weighting factors differ from w = 1 when the variances of the individual points are not the same. For the present data sets, all weighting factors can be set equal to unity. It follows from eq 10that the higher the smoothing parameter p is, the more weight is given to the smoothness of the curve. The lower the smoothing parameter is, the more weight is given to the goodness of fit. In general, a small experimental error corresponds to a low smoothing parameter value and a large experimental error to a large one. The order of the spline may have a slight effect on the result. We have chosen a third order, or quintic spline. The third derivative of this type of spline, required for a LOGA distribution, is a continuous function. Ideally the smoothing should be large enough to assure that fluctuations due to experimental error are smoothed away, but small enough to conserve inflections that are due to existing heterogeneity. However, when the latter inflections are only as large as the deviations resulting from experimental errors, no method can discriminate between the two, and the inflections will be smoothed away. This general understanding is, however, not enough to determine the value of the smoothing parameter. Statistical Optimization of the Smoothing Parameter. For a given smoothing parameter, two quantities related to the goodness of fit and the smoothness of the function can be calculated. The goodness of fit is expressed by the mean squared residual, R, (28): i
n
- -6.0
-13
-10
-7
-4
-1
2
-10
-7
-4
-1
2
I
-4.5
'
-13
log(P) Figure 4. Quantities obtained from the spline function analyses calculated for the synthetic data set with an error of rn = 0.01 as a function of the smoothing parameter. (a) Minimizingfunction for known error: log Tp (log p); (b) minimizing function for unknown error: log Gcv (10s PI.
which is minimized depends on whether the error in the data is known or unknown. Known Error. When the variance CJ in the data is known, the proper smoothing parameter is found by minimizing the true predicted mean squared error, defined as (29):
Tp= R, This quantity is equal to the goodness of fit criterion in eq 10 divided by n, the number of data points. For low values of log p, R, is very low because the spline follows the data closely. For large values of log p, R, sharply increases because the spline starts to deviate from the original data. The smoothness of the function is controlled by the quantity Q, (42): 1 Q, = 1- -trace(A,) n
(12)
where the matrix A, maps the original data points yi to the corresponding spline estimates sp(xi). Trace(A,) is a measure of how close the original data are followed. When trace(Ap) is close to n, the data points are followed very closely, when it is close to zero, a smooth function will be obtained (28). Thus, a smooth function corresponds with a high value of Q,, and a spline that closely follows the data points implies a small value of Q,. To determine the smoothing parameter, a function of R, and Q, has to be minimized. The type of function
+ a2(1- 2Q,)
(13)
Minimizing Tp as a function of p means that R, is kept as low as possible to assure a good fit and Q, is kept as high as possible to assure a smooth function. Figure 4a shows the log Tpfunction for the synthetic data set (all Wi equal to unity). A clear minimum in the function is found at log p = -5.6. Unknown Error. For unknown error, Craven and Wahba (29) showed that a suitable criterion to determine a proper smoothing parameter is provided by the generalized cross validation (GCV) function. Cross-validation is a statistical method of optimizingp. The basic idea of the cross-validation is to leave the data points out one at a time and then to select a smoothing parameter that predicts the missing data point the best. The GCV criterion can be written in terms of R, and Q, (28, 29): GCV, = R,/Q:
(14)
The GCV function can be seen as a balance between R, and Qp2. R, increases with p, whereas Q, decreases with p . The log GCV function for the synthetic data set (all Environ. Sci. Technol., Voi. 28, No. 6, 1994 1041
wi equal to unity) is shown in Figure 4b. The minimum is found at log p = -5.8, which is close to the value found using the known error (log p = -5.6). Several authors report that the GCV estimate for the smoothing parameter may deviate from the proper value (see, e.g., ref 43). In our case, we found that the GCV criterion results in either the correct or a too low value for the smoothing parameter. It is therefore necessary to add additional information to find the proper value of the smoothing parameter. We use physical constraints for this purpose. In principle, the constraints can be built in the spline algorithm (44). However, no flexible algorithm is available yet for a combined use of external constraints and the GCV criterion. Therefore, the physical constraints are tested separately. Physical Constraints To Determine the Smoothing Parameter. Also properties of the overall binding function can be used to decide whether peaks in a calculated distribution are the result of an underlying heterogeneity or of an experimental error. These properties will be called “physical constraints”, because they are independent of the experimental conditions, the type of distribution, and the magnitude of the experimental error (20). The role of the constraints is that when a constraint is not fulfilled, there are apparently too many inflections left in the binding curve that should be smoothed away, and thus the smoothing parameter should be increased until the constraint is satisfied. According to the local isotherm, the occupancy of a given type of sites increases with the concentration of the species; hence also, the sum of the occupancies of all site types increases with the concentration. This means that the first derivative of the binding curve with respect to log[MI should always be positive. In terms of experimentally accessible quantities, [MI and {Mt):
This constraint is generally valid and is equivalent with demanding that the CA distribution should always be positive. Further constraints depend on the function used as local isotherm. In ease of the Langmuir isotherm, which is a convex function when plotted on a linear [MI scale, the sum of the individual contributions is also a convex function. Thus, the second derivative of the binding curve with respect to [MI should always be negative. In terms of {M,) and [MI this constraint can be given as:
In most cases this is equivalent with demanding that the LINA distribution is positive. From the Langmuir functionality, it follows also that the LOGA-AS distribution should always be positive ( I 7). This leads to the third and most severe constraint:
The third constraint has the disadvantage that it cannot be applied directly to the basic data. In order to apply eq 17, the basic data should be converted to a binding curve for each adaption step of the smoothing parameter, which is computationally not efficient. 1042
Environ. Sci. Technol., Vol. 28, No. 6, 1994
The choice of a certain constraint can be coupled to the heterogeneity analysis method to be used. For the CA method, the first constraint should be fulfilled; for the LINA, the first and second should be fulfilled; and for LOGA and DEF, all three constraints should be obeyed. For a “Gamble plot”, i.e.,KDEF({Mb))(33),the Langmuir functionality demands that this plot should always be decreasing with increasing (Mb)(or [MI). This means that the DEF distribution should always be positive. Since in practice the DEF distribution becomes easily negative for high affinity peaks when noisy data are used, the use of this property to find a smoothing parameter might lead to oversmoothing and will not be used as a constraint. Variance of the Distribution Function
When a proper estimate of the smoothing parameter is obtained, an estimate can be made of the variance, gp2of the spline function (28),42):
cP2can be used in a variance analysis of the spline. The value of gp2can also be compared with an estimate of the experimental variance 02. Using the variance in y (either known experimentally or estimated by eq 18), it is possible to calculate the covariance matrix of the spline coefficients. For linear functionals of the spline sp(x),the variance of the function can be calculated directly from the variance in the spline function. Since we assume here a nonlinear relationship between the spline and the desired quantity, the variances have to be calculated by Monte-Carlo simulation. For this purpose, we used 100 realizations of the spline; these spines reflect the uncertainty in the data set. On the basis of each of these splines the distribution function can be calculated. In this way, 100realizations of the distribution function are obtained, which allow us to compute confidence intervals.
Calculation of the Distribution Function
The heterogeneity analysis is applied both to the synthetic data with the imposed artificial error and to the experimental binding data. The smoothing parameter is determined using the GCV smoothing spline technique in combination with the physical constraints. From the resulting spline, 250 data points are generated equidistantially spaced on the log[Ml axis. The CA, LINA, LOGA-AS, LOGA-1, and DEF methods are applied to these calculated spline data to obtain the distribution functions by numerical differentiation. Finally, confidence intervals are calculated using the Monte-Carlo simulations. Synthetic Data. From the spline procedure for known error, a smoothing parameter of logp = -5.6 was obtained. The GCV estimate was shown to be log p = -5.8. The splines based on these two values of the smoothing parameter obey the first and second physical constraint. In order to fulfill the third constraint, the smoothing parameter has to be raised t o log p = -5.0. The resulting distributions for log p = -5.0 are shown in Figures 5a-d and 6. Each distribution is presented by a solid curve, The error bars indicate the uncertainty in the distribution and range from plus the standard deviation to minus the standard deviation, respectively. These confidence intervals to not mean that the true distribution lies within this interval, but that for other realizations of
I
0.3
E cn
0.2
-0
W
.c
0.1
0.0 4
5
6
7
8
4
9
5
6
5m'
9
7
8
9
0.3
I
I
0.2
0.2 Y
cn
-0
-0
v Y-
8
logK
logK 0.3 I
7
U Y-
o. 1
o. 1
0.0
0.0 4
5
7
6
8
9
logK
4
5
6
logK
Flgure 5. Distribution functions obtained on the basis of the spline function calculated for the synthetic data set with a random error of u = 0.01. The dashed line represents the true distribution functions. The confidence interval is indicated by bars. (a) CA distribution; (b) LINA distribution; (c) LOGA-AS distribution; (d) LOGA-1 distribution.
~'
0.0 L_1 4
5
7
6
8
9
logK Flgure 6. DEF distribution function obtained on the basis of the spline function calculated for the synthetic data set with a random error of u = 0.01. The confidence interval is indicated by bars (solid curve). The dashed curve represents the true distribution function.
the random experimental error a result is found that lies within the indicated range with a probability of 66 76. The CA distribution (Figure 5a) is rather smooth and featureless. The two high-affinity peaks are reproduced very weakly. The variance in the result is rather low; only for affinity values lower than log K = 4.5 is the variance distinct. The LINA distribution (Figure 5b) is marginally better than the CA distribution. The variance has increased with respect to the CA result, especially for low affinities.
The LOGA-AS and LOGA-1 distributions (Figure 5c,d) are strongly flattened, but indicate the presence of both high-affinity peaks. However, the variances have strongly increased as compared to CA and LINA, due to the third derivative involved. For affinity values below log K = 5, the LOGA-AS and LOGA-1 results are not reliable any more as can be seen from the error bars. The DEF distribution (Figure 6) reproduces the peak around log K = 8, but the height of the peak is not defined since the variance in the result is very large. Between log K = 5 and log K = 8, the variances are reasonably low, slightly worse that the LINA result. Except for the highaffinity peak, the shape of the DEF distribution is similar to the LINA result but shifted to higher affinities. For low affinities, the variance in log KDEFis considerable (horizontal error bars). Experimental Data. We start by applying the smoothing spline procedure to the data set of Hansen et al. (30). This is done for the basic data, i.e., log[Ml versus {Mt).In this way the error is the same in all the data points, and the weighting factors may be set to 1. For (Mb)data versus log[Ml different weighting factors have to be used, which leads to a more complicated situation. From the GCV plot (Figure 7), it follows that a distinct GCV minimum is found a t log p = -4.6. Next, the spline obtained for log p = -4.6 is checked with respect to the physical constraints. Only the first constraint is obeyed yet. Both other constraints are obeyed after increasing the smoothing parameter to log p = -4.3. This value of log p is used for a further analysis. The resulting spline and the binding curve calculated from this spline are shown in Figure 3a,b, Envlron. Scl. Technol., Vol. 28, No. 6, 1994
1043
-1
I
-4
-13
-10
-7
-4
-1
2
Figure 7. GCV plot for the experimental data set. A minimum was found at log p = -4.6.
respectively (solid lines). From the spline statistics for log p = -4.3, an estimate of the error variance is obtained of u2 = 2.2 X This means a relative error in [MI of approximately 3.5% , which seems reasonable for an ISE experiment. The variances are indicated with error bars. The resulting CA distribution (Figure 8a) is a smooth exponentially shaped curve with low densities for high affinities and an increasing density for low affinities. This indicates that the experimental window only allows the calculation of the high affinity part of the distribution. The error bars indicate that the result is significant over almost the whole assessable log K range. The LINA distribution (Figure 8b) shows more or less the same picture, but for affinity values lower than log K = 6 the result is not reliable any more. The LOGA-AS distribution (Figure 8c) exhibits on top of the main exponentially shaped distribution a small highaffinity peak at log K = 8.5. The distribution for log K < 5.5 is not significant because the variance is very large in that region. With the LOGA-1 (Figure 8d) a similar picture emerges: the high-affinity peak at log K = 8.5 is more distinct; however, the variance is also larger. Below log K = 6, the distribution is not reliable any more. The DEF distribution (Figure 9a) shows large uncertainties at both ends of the distribution function. Similarly as for the LOGA method the DEF distribution is not reliable for affinities lower than log K = 6. In that region also the variance in log KDEFis considerable. An unexpected phenomenon occurs at the high affinity end of the distribution: negative values of the distribution are found at log K values in the range 7.2 < log KDEF< 8.1. To investigate this effect log KDEFis plotted as a function of log(Mt) in Figure 9b. With decreasing concentration (starting at log(Mt) = -4.5), log KDEFfirst increases as expected, but after a maximum at about log(Mt) = -5.8 and log KDEF= 8.1, the values of log KDEF decrease again, which is physically impossible. Also the variance in log KDEFis considerable for low values of (Mt). These values are responsible for the anomalous behavior of the calculated distribution function. This behavior is probably caused by a slightly nonlinear behavior of the calibration curve of the copper electrode for low concentrations. Selection of a Binding Model
The calculated distribution function can be used to select a binding model. The CA distribution obtained with the Hansen data shows a relatively wide and smooth distribu1044
Environ. Sci. Technol., Vol. 28, No. 6, 1994
tion indicating that a Langmuir-Freundlich (LF) equation (see eq 1) may be used to describe the binding data. In Figure loa, the best possible result of applying eq 1 is shown. Although the result is satisfactory for higher concentrations, the data systematically deviate from the calculated curve in the range -8 < log[Ml < 6. From Figure 10a and according to the LOGA distribution, it is expected that an additional high-affinity site will result in an improvement of the description. The best results are obtained for a site at log K = 8.1 with a density of 1.5 X lo4 mol/L in combination with a main peak with m = 0.48, log = 3.04, and {St)= 2.1 X lo4 mol/L (see Figure lob). The calculated curve agrees within experimental error with the measured data. A further illustration of the method can be found in ref 8 where the proton binding to humics is studied. Here the analysis indicates that the distribution is composed of two broad peaks. Consequently, to describe the data, a combination of two Langmuir-Freundlich equations was required. Discussion
Evaluation of Methods. In a previous paper we compared the LIA methods with the DEF method on the basis of their derivation and the results obtained for accurate synthetic data (21). In the present paper, the problems of dealing with data containing an experimental error are analyzed. From the LIA methods, the CA method is the simplest and most stable method with noisy data, The LINA method gives hardly better results than the CA, but the variance in the result is larger. In practice, the CA method is therefore to be preferred over the LINA method. The LOGA method gives in principle the best results, but is restricted to high-quality data. For poor data, details within the range of experimental error are smoothed away, and the LOGA distribution will resemble the CA distribution. The DEF method is suited to indicate a “minor sites affinity peak” at the high-affinity end of a distribution (35). Other peaks are less well reproduced. The DEF method is the least suitable to determine the shape and position of the background distribution. This is confirmed on theoretical grounds in ref 21. Also for the DEF method, high-quality data are needed otherwise large oscillations will appear a t both the high- and low-affinity range. Resolving Power. The analysis of the Hansen data shows the importance of the resolving power of the heterogeneity analysis methods. With the LOGA and DEF method, a high-affinity peak was found. Although the peak represents only a small fraction of the total (estimated) sites, these sites play a major role at low concentrations, since they are occupied preferentially. This has special significance for practical problems in the environment where heavy metals are often present in trace amounts. Under these conditions only the “strongest” sites will be occupied by metal ions. The high-affinity sites are occupied at low concentrations. Since (Mb] is found as (Mt) - [MI, the absolute error in (Mb) is equal to that in [MI, assuming that (Mt} is error free. Since the relative error in [MI is constant, the absolute error in [MI a t low concentrations, and thus in (Mb), is very small. This has the consequence that, in principle, a (small) high-affinity peak can be discerned better than a low-affinity peak. This is especially the case when a relatively large concentration of sites is present. Low affinities correspond with high concentrations of [MI, in this case the difference (Mb) = (Mt) - [MI i s relatively
1.5
1.5
-I-
o
o 0
n
0
1.0
n
1.0
\
\
E
E m l 0 I 0.5 1
-
m
= 0.5 LL 0
u.
0.0
0.0 4.5
7.5
6.0
4.5
9.0
6.0
0
1.0
n
u.
1.0
\
E a
E m 0
9.0
0
0
-
7.5
1.5
1.5
n
9.0
logK
logK
0
7.5
0
;s, 0.5
0.5
LL
0.0
0.0 4.5
7.5
6.0
9.0
logK
4.5
6.0
logK
Flgure 8. Distribution functions for the experimental data set based on the spline function (log p = -4.3). The symbols indicate the position of the original data on the log Kaxis. The confidence interval is indicated by the bars. Results are nornralizedwith respect to DOC. (a) CA distribution; (b) LINA distribution; (c) LOGA-AS distribution; (d) LOGA-1 distribution.
small, and the absolute error in [MI is relatively large so that the value of {Mb)is rather inaccurate. Hence, for a given accuracy in log[Ml, the resolution of the distribution function decreases with increasing log[M], or the lower the log K value of a peak, the harder it is to resolve this peak. Some ways to minimize the experimental errors are discussed in ref 22. Beside the effect of the experimental error, there is a clear effect of the background distribution. For instance, in the calculated distributions based on the synthetic data set the peak at log K = 6 is smeared out over the exponential course of the distribution, whereas the peak a t log K = 8 is clearly discovered. This is also true in general, a peak situated a t the tail of adistribution can be clearly observed, whereas a sharp peak placed on a steep part of a background distribution is difficult to find. A lucky coincidence is that a sharp peak somewhere in the middle of a broad distribution hardly effects the binding characteristics. Optimal Experimental D a t a Set. In this paper “basic data” were smoothed, that is, log[Ml as a function of log(Mt). Often data are presented in literature as (Mb) as a function of log[M]. This means that if only these data are available smoothing has to be performed on this transformed data set. As mentioned above, the variance in the error in {Mb)may vary strongly with [MI. This implies that the relative accuracy of the data should be individually accounted for by weighting factors, so that the larger the variance, the less weight is given to the data point. Often these variances are difficult to quantify; Silverman (28) describes an approach in which the weighting factors are
determined iteratively. Model calculations on our synthetic data set show that for a relatively good data set, that is, with sufficient data points (about 100) and a not too large error (