Environ. Sci. Technol. 2008, 42, 5210–5216
Quantitative Structure-Property Relationships for Predicting Metal Binding by Organic Ligands STEPHEN E. CABANISS* Department of Chemistry, University of New Mexico, Albuquerque, New Mexico 87131
Received April 23, 2008. Revised manuscript received December 4, 2007. Accepted March 27, 2008.
Quantitative structure-property relationships (QSPRs) are developed to predict the complexation of Al(III), Ca(II), Cd(II), Cu(II), Ni(II), Pb(II), and Zn(II) by organic ligands containing carboxylate, phenol, amine, ether, and alcohol functional groups. These QSPRs predict conditional stability constants (KM′ at pH 7.0 and I ) 0.1) over a range of ligand types with consistent uncertainties of ∼1 log unit without requiring any steric or connectivity information. Calibration and validation data sets were constructed using 1:1 complex formation constants from the NIST Critical Stability Constants database (version 8.0). The descriptor variables are intuitive quantities conceptually related to metal binding, such as the numbers of various ligand groups, charge density, etc. The resulting calibrations have r2 ) 0.87 to 0.93 and Spred ) 0.67 to 1.05 log units, with positive values for all ligand count descriptor variables. The QSPRs account for 75-95% of the variability in the validation data set with RMSE of 0.74 to 1.30 log units. These QSPRs improve upon previous work by providing a tested and mechanistically reasonable method for log KM′ prediction with uncertainties comparable to or better than other QSPRs calibrated with groups of diverse ligands.
Introduction Complexation of metal ions by organic ligands regulates metal reactivity, bioavailability, and transport in a variety of geochemical and environmental systems. For example, transport of radionuclides through the subsurface can be greatly enhanced in the presence of natural organic complexing agents (1, 2), while in surface waters organic complexation regulates the bioavailability, and hence the toxicity, of Cu(II) (3, 4). Environmental organic complexing agents include both identifiable molecules (5) and the heterogeneous mixture designated natural organic matter (NOM) or humic substances (6). While computational methods for calculating equilibrium concentrations are widely available (7), thermodynamic constants for complexation reactions of interest are less so. Dimmock and co-workers (8) have reviewed a number of methods for estimating stability constants K in the absence of experimental data, and Hancock (9) has commented on some of the shortcomings of these. In general, the most reliable approaches are based on linear free energy relationships (LFERs). It has been known for some time that stability constants for different metals with the same group of ligands are often correlated (10). * Corresponding author e-mail:
[email protected]. 5210
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 42, NO. 14, 2008
However, it is often desirable to estimate stability constants without the benefit of experimental data, for example in cases of ligand design or predicting the consequences of hypothetical geochemical ligands. In these cases it is necessary to use quantitative structure-property relationships (QSPRs) which estimate stability constants from only structural information. Input variables can be obtained from molecular modeling, graph theory, or from theoretical or ‘intuitive’ considerations (8, 9), while mathematical formalisms include linear and nonlinear equations with parameters estimated by various statistical methods (11). Stability constant QSPRs based upon molecular modeling of strain energies have been reported with very good predictive power within groups of similar molecules (12). Hay and co-workers (13) used molecular mechanics to evaluate ring strain in potassium-crown ether complexes, and the resulting QSPR had a root-mean-square error (RMSE) of only 0.2 log units. Comba and co-workers (14) used a similar QSPR to predict lanthanide binding by three organophosphate ligands and also obtained good linearity of predicted versus experimental log K. On the other hand, estimation of complexation energies using quantum methods has been less successful (15). Topological fragment-based approaches have been used on series of related molecules with reasonable success. Solov’ev and co-workers (16) developed a series of QSPR’s which predicted K(I) binding by phosphoryl-containing podands with standard deviation of prediction (Spred) between 0.32 and 0.52, which Solov’ev and Varnek (17) later improved to obtain r2 values between 0.90 and 0.95 and Spred of ∼0.3. However, when this approach was expanded to a more diverse set of ligands the Spred increased to >1.4 log units (11). A different graph-based approach fit complex formation enthalpies with Spred of 20-30 kJ/mol, but this corresponds to Spred of 4-5 log units (18). Recently Svetlinski and co-workers (19) used semiempirical molecular modeling calculations as descriptor variables in an ambitious attempt to predict stability constants for 14 lanthanide ions with 63 structurally diverse ligands. The resulting QSPRs for each metal had r2 in the range 0.85 to 0.91 and Spred in the range 1.5 to 2.2 log units. The more intuitive Hancock equations for carboxylate and amine ligands use separate terms for individual ligand groups and for the ‘chelate effect’ (20, 21). Harris expanded upon that approach to include different ligand groups and to explicitly account for different chelate geometries (22), using one coefficient for each type of ligand atom and additional terms to represent ring formation. Working with Ni(II), Zn(II), Cd(II) (∼90 ligands each), and Fe(III) (35 ligands), Harris developed QSPRs which were combined with an additional linear regression to obtain Spred of 0.70 to 0.74 for the divalent metals and 1.39 for Fe(III). However, it should be noted that the data in this study were sifted specifically to exclude ligands with large steric or inductive effects which might affect complex formation. Currently available QSPRs are not satisfactory for environmental and geochemical problems. First, many of the QSPRs are calibrated only for very restricted sets of ligands and cannot plausibly be applied to the wide variety of ligands present in soils and waters. Second, the predictive uncertainty of these QSPRs is often poorly characterized because they have not been tested against validation data independent of the calibration data (9). Third, many QSPRs rely upon exact structural information which is often unavailable for geochemical ligands; geochemical analyses of soil and waters 10.1021/es7022219 CCC: $40.75
2008 American Chemical Society
Published on Web 06/14/2008
often provide functional group and elemental composition data but not exact stereochemistry or even overall connectivity. This paper develops and evaluates QSPRs for predicting stability constants of environmentally and geochemically interesting metals with small molecules similar in ligand group composition to NOM. The chosen approach uses intuitive descriptor variables representing numbers of specific ligand groups and molecular properties plausibly related to complex stability and does not rely on stereochemistry or connectivity information beyond functional group composition. Work in progress will combine these QSPRs with dynamic structure models of geochemical ligands to predict metal complexation in natural systems (23, 24).
Thermodynamic Data Sets Two thermodynamic data sets are used in this work, a calibration data set used to obtain the QSPR parameters (Table S1, Supporting Information (SI)) and a validation data set used to test the predictive ability of the resulting equations (Table S2). Thermodynamic stability constants were obtained from the NIST Critically Selected Database of Stability Constants, version 8.0 (25), an annotated compilation of results from many different sources. Ligand molecules for both data sets were selected using the following criteria: (a) the presence of one or more of the functional groups: carboxylic acid, amine, and phenol, (b) absence of P, S, and halogen atoms, (c) availability of 1:1 binding constants for the metals of interest (Al(III), Ca(II), Cd(II), Cu(II), Ni(II), Pb(II), or Zn(II)), and (d) minimizing numbers of molecules with highly similar binding sites (e.g., the calibration data set includes acetic acid but not the homologues propanoic acid, butanoic acid, etc.). For each 1:1 complex, a constant determined at T ) 25 °C and ionic strength 0.1 was selected where available; otherwise a constant determined at similar conditions was selected. All constants selected had been determined at temperatures ranging from 20 to 30 °C, and no attempt was made to correct for temperature differences. Ionic strength values ranged from I ) 0 to 2.0, and the Guntelberg equation was used to estimate the constant at I ) 0.1 as needed. (No corrections were performed on constants determined at I > 1, since those corrections are susceptible to large relative errors). The calibration data set contains 78 distinct ligand molecules with 377 1:1 binding constants, while the validation data set contains 26 ligand molecules with 130 binding constants. The intention in selecting molecules for the validation data set was to obtain a similar group of ligands in terms of overall size and ligand groups, but to have different carbon skeletons and steric arrangements. For example, the calibration data set contains malonic acid, while the validation data set has diethylmalonic acid; pKa2 values for these molecules at I ) 0.1 are 5.27 and 6.96, respectively. In this study, the predicted complexation constant is a conditional constant at pH 7.0 and ionic strength I ) 0.1, expressed as M + L ′ a ML
KM ′ )
[ML] [M][L ′ ]
(1)
where [M] and [ML] are molar concentrations of the uncomplexed and complexed metal and [L′] is the sum of the concentrations of all ligand species which are not complexed by a metal. That is, [L′] is the sum of the free ligand concentration [L-m] and the various protonated ligand species concentrations [HL1-m], H2L2-m], etc. Using this form of the stability constant has the advantage that it can be calculated even for phenolic and amine ligands where the
pKa values are large and uncertain, as long as the exchange constant KM,ex is known Mn++ HmL a ML(n-m)+ + mH+
KM,ex )
[ML][H+]m [Mn+][HmL] (2)
In this case the correction is straightforward, KM ′ ) [H+]-mKM,ex
(3)
assuming that [HmL] is the predominant (>99%) protonation state at pH 7.0 (this is the case for the phenolic and amine compounds for which the exchange constants are provided in the NIST database (25)). For complex formation constants written in terms of a deprotonated ligand LmMn+ + Lm- a MLn-m
KM,f )
[MLn-m] [Mn+][Lm-]
(4)
the correction uses the R formalism where Rm is the fraction of the uncomplexed ligand which exists in the -m charge state and KM ′ ) RmKM,f’
(5)
In general, corrections for polycarboxylic acids are small, but corrections for amine and phenol ligands can be quite large, leading to KM′ values below 1 (log KM′ < 0). The same equations can be rearranged to adjust for other pH values, assuming that the pH and all relevant acidity constants are known (26). The calibration data set contains 78 ligand molecules but no more than 70 log KM′ values for any single metal. This reflects the variety of studies included in the NIST database, including various analytical methods and purposes. Some ligands have KM′ reported for only a single metal ion, while others have values for all seven. It should be noted that some ligands probably form complexes which are too weak to be reliably detected. Details on the calibration data set are in the Supporting Information.
Molecule Descriptors The QSPRs presented here are based upon the general linear equation log KM ′ )
∑ax
(6)
i i
i
where xi are the descriptor variables and ai are the coefficients of those variables (to be determined by regression). In the absence of a good theoretical justification for a constant term (intercept) in the QSPRs, this term was set to zero. Consequently each QSPR will have only one adjustable coefficient ai for each descriptor variable. Two types of descriptor variables were used in the calibration process, group count variables and multigroup variables. The six group count variables were slightly modified counts of ligand groups within a molecule: #COOH, #Amines, #Amides, #Alcohols (alkyl), #Phenols, and #Ethers (23, 24). For some molecules in the calibration database, steric considerations prevented coordination by multiple groups, e.g., by three phenols on the same aromatic ring. For example, 1,2,3-trihydroxybenzene has #Phenol ) 2. In addition, very few molecules had more than two of certain functional groups and no molecule had >4 carboxylic acids. Consequently, no molecule has a group count of #Phenols, #Alcohols, and #Amides greater than 2, and the maximum value of #COOH was 4. VOL. 42, NO. 14, 2008 / ENVIRONMENTAL SCIENCE & TECHNOLOGY
9
5211
The five multigroup variables were somewhat more complex, each using a unique formula: The square of the ligand number (LN2) partially accounts for the “chelate effect”; log K of complexation for a chelating ligand is larger than the sum of the log Ks for monocoordinating ligands. This descriptor was calculated as LN2 ) (#COOH + #Amine + #Phenol)2
(7)
for each molecule for metals with significant binding by phenols (Al(III) and Cu(II)) and as LN2 ) (#COOH + #Amine)2
(8)
for other metals (the number of phenol containing ligands in the calibration data for these other metals was not large enough to justify their use). For the calibration database the sum in parentheses never exceeded 6, the maximum coordination number for these metals. The charge density descriptor (Zdens) was calculated for pH 7.0 assuming that all carboxylic acids but no phenols or amines were deprotonated, resulting in charge Z ) (#COOH - #Amines) and Zdens )
Z MW
(9)
where MW is the molecular weight. The third multigroup descriptor, the heteroatom-tocarbon ratio, Het:C, was used in a previous QSPR for Cu(II) (24). It is calculated from elemental composition as Het : C )
#O + # N #C
(10)
where #X refers to the number of atoms of element X in the molecule. For several metals, molecules with both amine and carboxylic acid ligands behave somewhat differently than molecules with only one type. The effect appeared to be largest for smaller molecules, and so an amino acid variable AA was calculated as AA ) 1 ⁄ MW
(11)
if both groups were present on a molecule; otherwise AA was set to 0. Finally, a variable to represent the probability of carboxyl and phenol groups being close enough to each other to form a chelating complex was added. This clustering variable was calculated as clustering ) (#COOH + #Phenol) ⁄ MW2
(12)
where the dependence on 1/MW2 represents the probability of two ligand groups being nearby. Only the first three multigroup variables (LN2, ZD, Het:C) were initially tested; the latter two were added after initial regression results revealed systematic error patterns related to amino acids and group densities. However, these variables were subsequently tested just as the first three.
Calibration of QSPRs The data set used here is subject to uncertainties and errors due to different temperatures, background electrolytes, and analytical methodologies. Reported analytical uncertainties in the NIST database are typically 0.05 to 0.1 log KM′ units (25). Uncertainties due to temperature are probably similar in size. Variability due to background electrolyte can arise from either (a) incorrect extrapolation from experimental conditions, which may be up to 0.5 log units in some cases depending on ligand charge and the ionic strength and (b) differences between ‘inert’ electrolytes. For example Ca-EDTA 5212
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 42, NO. 14, 2008
KM′ values differ by 0.16 log units between K+ and tetraalkyl ammonium ‘inert’ salts at 25 °C and ionic strength 0.1; for Al-citrate, the difference is 0.31 log units (25)). In addition, methodological choices may lead to errors which are much larger than these uncertainties; for example, the complex of Al(III) with D-tartrate has log KM′ of 3.39 in the NIST database while Nordstrom and May (27) prefer a value of 5.32, well outside the range of experimental uncertainties. Given this array of uncertainties and variations in the data, it is unreasonable to expect a predictive uncertainty of less than ∼0.3 log units. At the other extreme, the largest tolerable uncertainty depends upon the requirements of the application. While QSPRs for lanthanide series elements have been reported with predictive errors of 2 log units (19), this level of uncertainty may be too high for many applications. The goal of this paper is an uncertainty of ∼1.0 log units KM′. QSPRs were developed for each metal using the calibration data set, following similar procedures for each. First, any ligands with log KM′