NMR Landscapes for Chemical Shift Prediction - The Journal of

Aug 17, 2012 - For families of related molecules containing varied chemical moieties, .... For all species (a) through (f), the scaffold is denoted by...
1 downloads 0 Views 2MB Size
Article pubs.acs.org/JPCA

NMR Landscapes for Chemical Shift Prediction Katharine W. Moore, Richard Li, Istvan Pelczer, and Herschel Rabitz* Department of Chemistry, Princeton University, Princeton, New Jersey 08544, United States S Supporting Information *

ABSTRACT: The ability to reliably predict NMR chemical shifts plays an important role in elucidating the structure of organic molecules. Additionally, an intriguing question is how the multitude of variable factors (structural, electronic, and environmental) correlate with the actual electromagnetic shielding effect that determines the chemical shift value. This work presents NMRscape as a new tool for understanding these correlations by constructing the landscape that describes the relationship between the chemical shift value and the moieties bonded to a molecular scaffold. The scaffold may be as small as a single atom probed by NMR or a larger molecular framework containing the probed atom. NMRscape operates with only a list of the chemical moieties bonded to the scaffold, without utilizing any potentially biasing chemometric descriptors. The corresponding chemical shift landscape is constructed based on fundamental physical principles, which makes NMRscape a credible chemical shift prediction and analysis tool. As an illustration, we demonstrate that NMRscape can predict 13C chemical shifts with an accuracy exceeding the substituent chemical shift (SCS) increment, hierarchical organization of spherical environments (HOSE), and neural networks (NN), methods for three distinct families of molecules sharing a common scaffold structure with moieties placed at two variable sites. The constructed NMR landscapes confirmed known empirical rules relating chemical shift values to the variation of chemical moieties on a scaffold, as well as uncovered hitherto hidden relationships. The practical importance of NMRscape is discussed.

1. INTRODUCTION Knowledge of accurate 13C NMR chemical shifts is important for identifying the structure of organic compounds. With this need in mind, a variety of 13C chemical shift prediction methods have been developed,1 including additivity models,2−4 empirical techniques such as hierarchical organization of spherical environments (HOSE)5 and neural networks (NN),6 as well as quantum chemical calculations.7,8 In particular, HOSE and NN are reported to give accurate 13C chemical shift predictions with a mean error of less than 2 ppm when using large training databases, typically containing many thousands of molecules with known chemical shift values.8,9 However, these methods fail to give accurate predictions for some molecules, with maximum reported deviations often being 15−20 ppm or more.8,9 13 C chemical shift prediction methods are generally based on (i) defining a sufficient number of parameters to describe a target carbon atom in order to evaluate its similarity to carbon atoms in the training databases and/or (ii) calculating the effects of electrons or other charges, molecular geometry, etc., © 2012 American Chemical Society

using electronic structure methods. In order to implement these prediction tools, carbon atoms are classified in various ways, including by chemometric descriptors (e.g., hybridization, substitution pattern, etc.),10,11 the number and type of atoms at positions one or more bonds away from the target carbon,9 the choice of density functionals and basis sets for electronic structure calculations,12 or other parametrizations. For families of related molecules containing varied chemical moieties, substituent chemical shift (SCS) increments for the target carbon atom are defined using the attached chemical moieties and their known interactions as parameters.13 Employing any parametric description of a target NMRactive nucleus in a molecule implicitly relies on the assumption that its chemical shift δ can be written as a function F of the parameters {pi} describing the target nucleus, i.e., δ = F(p1, p2,...). The functional relationship F between the chemical shift Received: June 27, 2012 Revised: August 15, 2012 Published: August 17, 2012 9142

dx.doi.org/10.1021/jp306353b | J. Phys. Chem. A 2012, 116, 9142−9157

The Journal of Physical Chemistry A

Article

δ and the parameters {pi} defines an NMR chemical shift landscape. The dimensionality of the landscape is determined by the number of parameters {pi}, or variables14 employed, and methods exist to analyze such landscapes in high dimensions.15 The landscape may be visualized when adequately described by two variables, as shown by the schematic illustration in Figure 1. The NMRscape method introduced here can be adapted to

the topological nature (i.e., local and global maxima and minima) of chemical property landscapes may be assessed generically.16,17 The latter analysis, called OptiChem theory, rests on two key assumptions: (i) the property is physically well-defined and (ii) a sufficient number and type of variables are employed to adequately permit free movement on the landscape F (see refs 16 and 17 for details). Simple physical considerations can often ensure satisfaction of assumption (i); any reasonable physical or chemical property such as catalytic activity, binding efficiency to a protein, or spectral feature satisfies assumption (i). In particular, the NMR chemical shift of an atom is a well-defined property that should satisfy assumption (i). It is difficult to assess whether assumption (ii) is satisfied a priori with a specific set of variables or resources. However, when both assumptions (i) and (ii) are satisfied, mathematical analysis16,17 shows that the property landscape is not only smooth but also has a monotonic topology, (i.e., there are only extrema at the globally maximal and minimal values of F). Saddles may exist at intermediate values of F, but they do not alter the monotonic nature of the landscape. In the schematic of Figure 1, the landscape (a) is smooth, but contains two maxima with different values of F, while the landscape (b) is monotonic. Employing an insufficient number and/or type of variables violates assumption (ii) of OptiChem theory and can artificially alter the landscape topology such that local maxima arise, as in Figure 1a. The topological conclusions about chemical property landscapes arise naturally from either a classical or quantum-mechanical description of an open system interacting with its environment.18 The number and type of variables required to generate a monotonic chemical property landscape is typically unknown a priori and depends on the molecules or materials involved, as well as the property. However, extensive evidence from the chemical literature suggests that, in many cases, only a few properly chosen variables are necessary to satisfy assumption (ii) of OptiChem theory and generate monotonic property landscapes.16,17 In light of this evidence, we will assess the extent to which assumption (ii) holds when using two or three variables for describing 13C NMR chemical shifts in families of molecules containing a common molecular scaffold. Importantly, the underlying principles of OptiChem theory do not require the specification of any particular scaffold; a scaffold may range in size from a single probed atom to a large molecular framework. The choice of scaffold in each illustration here serves to simplify the demonstration and discussion. Successful construction of monotonic landscapes expressing the chemical shift value in terms of two or three variables would indicate that the variables adequately describe the underlying chemical shift landscape. In particular, we investigate the extent to which such monotonic 13C NMR chemical shift landscapes exist and facilitate accurate chemical shift prediction. Of course, in any application, it is possible to employ an inaccurate monotonic landscape that produces poor chemical shift predictability. Importantly, OptiChem theory predicts, upon satisfaction of its assumptions, the existence of a monotonic landscape that captures the underlying relationship between the 13 C NMR chemical shift and the molecules involved, which should enable accurate prediction of chemical shift values. This conjecture will be assessed and applied by investigating the utility of the landscapes for predicting the chemical shifts of unknown compounds via cross-validation. The same principle can be applied when arbitrary numbers of variables are required,15 for example, when more than three sites are

Figure 1. Schematic illustration of NMR chemical shift landscapes, where the shift δ is a function of two chemical moieties given the labels X and Y on distinct sites of a molecular scaffold (see Figure 2 for specific cases). The pixelation evident in the landscape plots is due to the inherent discrete character of chemical moieties; the present plots contain 50 distinct moieties for X and Y. (a) Two distinct maxima (hills) with different values of δ. The global maximum is marked with a yellow star, and the local maximum is marked with a black x. (b) One maximum. Chemical property landscapes, including NMR chemical shifts, have been shown to have the topology with a single maximum depicted in panel b, upon satisfaction of the two assumptions of OptiChem theory.16,17

investigate NMR chemical shift landscapes with arbitrary numbers of variables,15 but the families of molecules examined here are well described by only two or three judiciously chosen variables. A landscape capturing the functional relationship between an NMR chemical shift and a set of variables is a specific example of the more general concept of expressing any chemical property (e.g., catalytic activity, luminescence under optical excitation, binding efficiency to a protein, etc.) as a function of a suitable set of variables describing the molecule or material involved.16,17 In general, any chemical property may be written as a function F(p1, p2,...) of some set of variables {pi} (e.g., appropriate chemometric descriptors, the chemical moiety labels on a scaffold in this article, etc.), thereby forming a chemical property landscape. Recently, we demonstrated that 9143

dx.doi.org/10.1021/jp306353b | J. Phys. Chem. A 2012, 116, 9142−9157

The Journal of Physical Chemistry A

Article

require the use of landscape analysis methods beyond those employed here.15 This work will demonstrate that NMR chemical shift landscapes as a function of the two variables defined on the scaffolds in Figure 2b−d may be used to predict unknown chemical shifts with higher accuracy than widely used empirical methods. We only demonstrate the superior predictive capability of NMRscape for three illustrative molecular families, and additional tests may be performed with other spectral data. We also demonstrate that the chemical shift landscapes revealed by NMRscape can confirm previously established empirical rules relating chemical shift values to molecular structure, as well as uncover otherwise hidden relationships. As the article aims to present the conceptual foundation of NMRscape, the actual development and implementation of a fully automated software implementation is beyond the scope of this work. Section 2 presents the NMRscape method used to generate 13 C chemical shift landscapes in terms of the moieties on a scaffold, as well as discusses how these landscapes are employed to produce accurate chemical shift predictions. Section 3 presents examples of chemical shift landscapes for several classes of compounds and demonstrates the predictive capabilities of NMRscape. Section 4 shows that the landscapes produced by NMRscape can both confirm known and uncover new rules for predicting 13C chemical shifts. Concluding remarks are presented in section 5.

functionalized on a scaffold. Furthermore, the same methods are applicable to 1H, 19F, or any other NMR-active nucleus, but we restrict the examples presented here to 13C because the spectral databases employed are most complete and more reliable for this nucleus.19 This work considers 13C NMR chemical shifts in libraries of molecules whose members share a common scaffold with variable chemical moieties bonded to specific sites. The six scaffolds considered in this article are shown in Figure 2. For

2. NMRSCAPE METHODOLOGY This work presents particular examples of chemical shift landscapes where moieties at two or three independent sites on a scaffold appear to satisfy assumption (ii) of OptiChem theory, as the resulting landscapes are both smooth and monotonic.16,17 We will also demonstrate that the landscapes produce highly accurate chemical shift predictions, as well as provide insight into empirical rules relating 13C shift values to the chemical nature of the moieties on the scaffolds and their complex interactions. Since chemical moieties as variables are inherently discrete objects, it is not obvious that smooth landscapes relating chemical shifts to moieties (e.g., landscapes in Figure 1, accepting finite resolution due to the discrete variables) may be generated at all. Each chemical moiety will contribute both independently and cooperatively with moieties at other sites to influence the shift value in ways that are not always known a priori. As a result, the presentation of a landscape similar to Figure 1 poses an apparent ambiguity that must be addressed. In particular, to generate a smooth landscape, we need to assign an appropriate order to the moieties at each site, i.e., along each axis of the landscape. OptiChem theory predicts the existence of a monotonic landscape (to within finite resolution) upon a suitable ordering of the moieties to define the variable axes but cannot predict whether an ordering that reveals a monotonic landscape is unique; the present results suggest a certain degree of degeneracy of orderings that produce similar landscapes, as discussed below. The principles of NMRscape apply with any number of variables, but the concepts can be appreciated when operating with two variables X and Y, and the methodology below will be explained in these terms. Figure 3 illustrates the importance of identifying an appropriate order of the chemical moieties at each scaffold site to produce a smooth landscape for the case of the 13C chemical shift of a trans-alkene carbon (i.e., scaffold (a) in

Figure 2. Molecular scaffold structures employed in this work. For all species (a) through (f), the scaffold is denoted by the structure within the blue dashed line. Variable moieties bonded to the scaffold are indicated by X, Y, and Z in green. The chemical shift of the carbon atom denoted by a red dot on each scaffold is measured. The zigzag lines in (e) indicate the possible stereochemistry of X and Y; the choice of stereochemistry is included in defining the variable moieties.

each molecule (a) through (f), the scaffold is denoted by the dashed blue line, the target atom for measuring the 13C chemical shift is denoted by the red dot, and the variables are indicated in green as X, Y, and Z. The variables take on the values of specific chemical moieties bonded to the sites; this perspective on the choice and use of variables is explained in section 2. The locations of the variable sites are fixed in each case; for example, the moieties X and Y are fixed in a para substitution pattern on the benzene scaffold in (b). This consideration of chemical moieties on a scaffold is analogous to employing SCS increments for 13C chemical shift prediction.3,20 However, NMRscape can take into account cooperativity between moieties at distinct sites because no additivity relationships are assumed. The definition of a scaffold for NMRscape is somewhat arbitrary in the current discussion; for example, the boundaries of the boxes in Figure 2 could be shifted upon making corresponding changes to the definition of each variable. Similarly, additional variables could be introduced by including variation at another scaffold site (e.g., replacing the hydrogen atom in (f) with an additional variable W) or allowing parts of the scaffold itself to vary (e.g., allowing double bonds in the ring structures of scaffold (e) to produce an additional scaffold variable). Such changes would raise the dimensionality of the resulting chemical shift landscape and 9144

dx.doi.org/10.1021/jp306353b | J. Phys. Chem. A 2012, 116, 9142−9157

The Journal of Physical Chemistry A

Article

Figure 3. Illustration of how the choice of moiety ordering O determines the character of the chemical shift landscape and the resulting chemical shift prediction quality. For a fixed scaffold (here a vinyl group) with variable moieties at sites X and Y, each moiety is arbitrarily given an integer label, as shown in panel a. The experimental chemical shift value is color-coded, as shown by the side bar, with white squares indicating that no spectral data were available. The resulting landscape (a) apparently has no underlying structure. Permuting the rows and columns of landscape (a) using the sorting algorithm, and accordingly shuffling the integer labels, produces the smooth landscape (b), and the reordered integer labels are indicated in violet. The black moiety labels in panel b are the same as those in panel a, but now permuted. The resulting order of the moiety labels shown in panel b is different for the two variable sites X and Y since they are not chemically equivalent. Truth plots of the predicted versus experimental chemical shift values are presented for each landscape, with (b) producing a distinctly superior truth plot.

Figure 2). In this example, there are N = 15 particular moieties considered at each site X and Y, producing N2 = 225 distinct chemical shift values from the N(N + 1)/2 = 120 molecules in the library. Each molecule in the library remains the same when permuting the moieties at X and Y, but the 13C labeled carbon atom has a distinguishable environment upon permutation of the moieties, resulting in an independent chemical shift. The chemical shift data are taken from the KnowItAll spectral database,19 which contains M = 153 distinct chemical shifts for molecules in this library, meaning that N2 − M = 72 chemical shifts were not measured. In Figure 3, these unknown chemical shifts are indicated as white squares, whereas the experimentally observed chemical shifts are denoted by colored squares reflecting their value. To construct a chemical shift landscape as a function of the moieties X and Y, each moiety is initially assigned an arbitrary integer label, e.g., −CH → 1, −CN → 2, etc., as indicated in Figure 3a. The integer labels define a moiety ordering 6 . In the present case, the ascending integer ordering is 6 Xa , 6 Ya = 1,2,...15, where the superscripts denote the sites X and Y, and the subscript a indicates that the ordering produces the landscape in Figure 3a. As expected, this landscape appears to be random and is certainly not smooth or monotonic. To produce a smoother landscape, both the rows and columns of the landscape in Figure 3a may be permuted, i.e., the orderings 6 Xa and 6 Ya may be reassigned. The process can be likened to solving a two-dimensional Rubik’s

cube to obtain the smoothest possible landscape, which may or may not also be monotonic. In the present case, N! = 15! ≈ 1012 unique orderings of the moieties are possible at each site, so an automated search procedure must be employed, as will be explained below. Figure 3b shows that an appropriate choice of orderings 6 Xb and 6 Yb produces a smooth and monotonic landscape. Importantly, the same data is present in Figure 3a,b; only the rows and columns of the data have been shifted in going from (a) to (b), as shown by the black and violet integer labels, respectively, on the landscape in (b). The moiety orderings 6 Xb and 6 Yb in Figure 3b are different because the two sites X and Y are not chemically equivalent. The quality of a landscape upon any particular choice of moiety ordering 6 i labeled by index i may be quantified in various ways.21−23 One method is based upon fitting the experimental landscape data to a flexible analytical function, where a sophisticated fitting procedure may make use of many different analytical functions, including radial basis functions, Fourier series, etc. For simplicity, this work employs a leastsquares fitting of the M available experimental chemical shifts to a polynomial function P(x,y). As high-order polynomial functions are smooth but not necessarily monotonic, this choice of basis provides a suitable means to assess the existence of a monotonic landscape. To define the polynomial axes x and y, an ordering 6 Xi is mapped to x = 1, 2, ..., NX, where NX denotes the number of moieties at site X. Similarly, the 9145

dx.doi.org/10.1021/jp306353b | J. Phys. Chem. A 2012, 116, 9142−9157

The Journal of Physical Chemistry A

Article

ordering 6 Yi is mapped to y = 1, 2, ..., NY. An example of this mapping is shown in going from the black integer labels to the violet labels on the landscape in Figure 3b. The highest order polynomial terms employed in the present work were chosen to ensure sufficiently accurate fitting, such that the fit quality (c.f., eq 1) improved less than 0.02 ppm upon adding one additional polynomial order to avoid overfitting. The resulting polynomial fitting functions are between eighth and 12th order, depending on the molecular scaffold. To account for cooperativity between substituent sites, cross-terms (i.e., dependent on both x and y) to all orders are included in the polynomials. The resultant least-squares fitted surface of the experimental data generates a predicted (fitted) chemical shift value for each of the M experimental chemical shifts, as well as predicted values for chemical shifts of compounds not present in the database (i.e., the white squares in Figure 3), which makes NMRscape an effective chemical shift prediction tool, as will be discussed further below. The fitted landscape may be evaluated in various ways to assess the quality of the ith ordering 6 i, including from metrics based upon the norm of the landscape gradient.21 Here, we simply employ the accuracy of the least-squares fit (i.e., the agreement between the experimental shift values and those predicted by the polynomial fitting function) to assess the quality of 6 i. Let each experimental chemical shift be indexed by m, m = 1, ..., M. The deviation between each predicted chemical shift value δPm and the corresponding experimental value δEm is easily visualized by a truth plot pitting these quantities against each other. Small deviations from the δPm = δEm line for all m indicate a high quality landscape fit. Truth plots obtained upon fitting the landscapes in Figure 3 to an eighthorder polynomial are shown to the right of the respective landscapes, which demonstrates the superiority of landscape (b). To assess the truth plots quantitatively, we employ the mean absolute error (MAE) between the experimental chemical shifts δEm and their predicted values δPm MAE =

1 M

The permutations that yield the smoothest landscape (e.g., the transition from Figure 3a to b) with the best MAE value may, in principle, be discovered by inspection in simple cases. However, since the number of possible moiety orderings grows factorially with the number of moieties on each site, manually finding the ordering that produces the smoothest landscape is generally infeasible. While chemical intuition may, in some cases, yield a reasonable ordering of the moieties to produce a relatively smooth landscape,24 under most circumstances, there is no way to a priori know the best moiety ordering. Thus, it is necessary to employ automated moiety reordering procedures to obtain smooth landscapes. Importantly, when assumption (ii) of OptiChem theory is satisfied, that is, the chosen chemical moiety variables produce an adequate set, then a proper choice of the ordering should produce a landscape that is also monotonic,16,17 as exemplified by the landscape in Figure 3b. A variety of reordering methods have been employed to uncover smooth and monotonic landscapes relating moieties on a scaffold to chemical properties including glass transition temperature,21 photoluminescent emission energy,22 and binding efficiency to a protein.23 All of these reordering algorithms account for cooperative effects between the moieties at different sites that may influence the property value. Additivity, or independence of the substituent sites, is commonly assumed in NMR prediction methods,2−4 but this assumption is often inadequate to fully describe NMR shift behavior13,25,26 because of cooperative influences between the moieties upon the chemical shift value. Thus, the two reordering algorithms employed in this work account for cooperativity. Both algorithms will be illustrated for two independent sites X and Y, but can be generalized to arbitrarily high dimensions, where special interpolation methods can be used for quantifying the landscape smoothness.15 The first reordering procedure utilized in this article is a sorting algorithm23 that orders the moieties from those producing the lowest to highest average chemical shift values. For each distinct chemical moiety x = 1, 2, ..., NX on site X, the average chemical shift deviation μx is obtained from summing both over all chemical moieties x′ ≠ x and over all chemical moieties on site Y (y = 1, 2, ..., NY):

M

∑ |δmP − δmE| m=1

(1)

Another complementary metric to gauge the prediction quality is the maximal deviation between the predicted and experimental chemical shifts dmax = max|δmP − δmE| ≥ MAE m

μx =

1 nx

NX

NY

∑ ∑ (δxy − δx ′ y), x ′≠ x

y

∀ δxy , δx ′ y (3)

where nx is the number of experimental chemical shift values in the summation, and the latter condition (i.e., for all observed δxy, δx′y) stipulates that the summation is only carried out over compounds for which chemical shift data are available. The analogous formula to eq 3 for site Y generates the average deviations μy for each moiety y = 1, ..., NY. The values μx and μy are then independently sorted from lowest to highest to generate the orderings 6 X and 6 Y. The sorting technique can easily be extended to any number of substituent sites. While the independent sorting of each substituent site inherently assumes some degree of additivity, the averaging over all of the moieties from the other site(s) implicitly takes into account cooperativity between the moieties on each site. The sorting method is computationally inexpensive but does not always generate the smoothest possible landscape. Thus, this work also performs combinatorial optimization with a genetic algorithm (GA)27 to minimize the MAE metric in eq 1 directly, which in some cases can produce landscapes with significantly better MAE values than attained by the sorting

(2)

Both metrics are typically employed to assess the quality of chemical shift predictions with empirical methods.8,9 The latter works also employ the standard deviation between the δPm and δEm, but only the MAE and dmax are considered here for simplicity and because the standard deviation was found to exhibit the same qualitative behavior as MAE. Both metrics are calculated using the same polynomial fitting function to generate the predicted shift values. The superiority of the landscape in Figure 3b is apparent from its improved values of MAE = 1.36 ppm and dmax = 5.69 ppm when fitted to an eighthorder polynomial function as compared to the arbitrarily assigned initial ascending integer label ordering in Figure 3a, which produces the high values MAE = 7.19 ppm and dmax = 25.1 ppm. The MAE will be used both to evaluate the quality of the landscape produced by a specific moiety ordering and the ability of NMRscape to predict unknown 13C chemical shifts. 9146

dx.doi.org/10.1021/jp306353b | J. Phys. Chem. A 2012, 116, 9142−9157

The Journal of Physical Chemistry A

Article

marginally improved upon subsequent use of the GA), this procedure is followed. In one case discussed in section 3.3, the sorting algorithm produces a distinctly inferior landscape, so the GA is utilized for prediction. Because of the significant computational expense of running the GA for each of the M − 1 restricted data sets, the optimal moiety ordering obtained from the GA using the full database of M shifts is employed. Polynomial surfaces with this ordering are constructed for each incomplete data set of size M − 1 to predict the missing chemical shift. While this process is to some extent biased because the optimal ordering was found using the chemical shift that should be missing, removing one chemical shift from the data set and running the GA was found to produce similar or identical optimal orderings in several runs; so, it is expected that the current procedure does not confer a significant biased advantage. To accomplish assessment (2), reduced training sets containing k experimental chemical shifts are constructed and the remaining M − k unknown chemical shifts are predicted using a polynomial surface fitted to the landscape with the optimal ordering obtained with k shifts. The number of unique training sets of size k is factorially large, and each training set may produce distinct optimal orderings that yield different predictions for the M − k unknown shifts. For cases where the sorting algorithm produces a good-quality landscape, 1000 unique training sets of each size k are evaluated in order to identify the statistically expected MAE value when employing a (random) training set of size k. For the case described in section 3.3 where the GA is employed to identify the optimal ordering, only one reduced training set of each size k is constructed. Nearly all of the experimental chemical shifts in this work are collected from the KnowItAll Informatics database.19 Spectra taken in CDCl3 as the solvent are used where available; otherwise, spectra taken in DMSO or CDCl3/DMSO are employed (approximately 10% of all shifts). Where multiple spectra measured in CDCl3 were available for the same compound, one spectrum is chosen arbitrarily. In the latter situation, since distinct spectra usually agree to within 1 ppm, different choices of spectra were not found to significantly affect the optimization procedure or the prediction quality. Spectra for 18 additional compounds were measured in this work using samples purchased from Sigma-Aldrich [St. Louis, MO]. The samples were used without further purification, and the NMR spectra were taken in CDCl3, with the exception of one sample discussed later. The latter compounds measured in this work, along with their relevant chemical shifts in CDCl3 are provided in Table 2, as well as Tables S1 and S2 of the Supporting Information. In order to compare NMRscape’s predictive capability with existing empirical methods, assessment (1) above is also performed using SCS increment, HOSE, and NN methods. SCS increments are taken from tables3,20 and the appropriate additivity equations and corrective terms3,26 are evaluated to predict chemical shifts. The HOSE and NN methods incorporated into the MODGRAPH software package,29 along with its associated spectral databases, are employed without alteration. MODGRAPH and KnowItAll share some, but not all, of the same spectral databases.19,29 When a target chemical shift exists in the spectral database used for HOSE, the method selects the experimental chemical shift value without making a prediction. Thus, HOSE predictions in this work are made only for compounds that are not present in the

method. The GA seeks an ordering that minimizes MAE by mimicking biological evolution. The ith ordering 6 iX, 6 iY specifies a genome subject to optimization. The landscape for this genome is fit to a polynomial with all cross-terms, as described earlier, and this polynomial is used to evaluate MAE(OXi ,OYi ). Genomes producing lower MAE values are fitted from an evolutionary standpoint, and the algorithm gives them a greater chance to survive and reproduce. The present algorithm implementation begins with an initial population of i = 1, 2, ..., 1000 randomly chosen genomes, whose MAE values are evaluated. The probability of a genome being allowed to reproduce, i.e., to form the genomes of the next generation, is inversely proportional to its MAE value. The 10 best members of the population are passed unchanged into the next generation, and 990 new genomes are generated through crossover and mutation operations acting on the probabilistically chosen parent genomes.27 The process is repeated iteratively until the MAE value of the best genome remains constant for at least 100 generations, and typically ∼2000− 3000 generations are needed to achieve convergence, which typically takes several hours of run time on a desktop computer, and only minutes on a suitable computer cluster. In some cases, the initial population of the GA is seeded with the ordering obtained from the sorting algorithm. This process decreases the number of generations required for convergence by a factor of ∼10 and tends to produce an optimal ordering similar to that obtained from sorting, but with the landscape exhibiting a slightly improved MAE value. This result indicates that related moiety orderings can produce similar monotonic landscapes. Importantly, the run time of the various chemical shift prediction methods is generally not of practical concern, as taking NMR data is the most time-consuming step. The latter effort would be particularly significant in the case of a new family of compounds being considered, in which case NMRscape can make experimental measurement more efficient by guiding the sampling of compounds based on the desired prediction accuracy over the full landscape. Further discussion of this matter will be given below. Chemical shift predictions using NMRscape are generated upon identification of an optimal moiety ordering and the associated landscape, as described above. Two distinct crossvalidation assessments of the prediction quality are examined in this work: (1) predict each chemical shift in the available database while employing all other available data to deduce the moiety orderings (i.e., leave out one cross-validation28) and (2) predict a set of unknown chemical shifts using small training databases (i.e., random subsampling cross-validation28). Assessment (1) is commonly employed to evaluate empirical methods, where large chemical shift databases are utilized to predict the chemical shift of additional compounds.8,9 Assessment (2) aims to identify the limits of prediction capability for NMRscape and is rarely considered in testing existing empirical methods, as restricting the database size is generally detrimental to accurate prediction. To accomplish assessment (1), the full data set containing M chemical shifts has member m removed, and the unknown mth shift is predicted from the remaining data set of M − 1 shifts. The predicted value δPm for this shift is generated from the leastsquares fitted surface of the optimally ordered data set of M − 1 shifts. Upon evaluating δPm for each shift, the MAE over all M predicted shifts is computed. For cases where the sorting algorithm produces a good-quality landscape (i.e., the MAE produced by the ordering obtained from sorting is only 9147

dx.doi.org/10.1021/jp306353b | J. Phys. Chem. A 2012, 116, 9142−9157

The Journal of Physical Chemistry A

Article

Table 1. Optimal Order of Moieties for para-Disubstituted Benzene Compounds at ipso (Y) and para (X) Positions of the Scaffold (b) in Figure 2 order

ipso moiety

para moiety

order

ipso moiety

para moiety

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

I CN Br H COOCH3 COOH CH2CN COOCH2CH3 CF3 COCl CONH2 CH2COOH Cl CON(Me)2 CHO acetyl cinnamoyl CO-phenyl COCH2CH3 CH2Cl CH3 SCH3 NHAc SO2CH3 phenyl

N(Et)2 N(Me)2 NH2 NHCH3 NH-phenyl OH OCH3 OCH2CH3 O-phenyl NHAc F OCON(Me)2 OCONHCH3 t-butyl OAc CH3 SCH3 cyclopropyl i-propyl benzyl CH2CH3 n-propyl Cl CH2COOH Br

26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

CH2OH benzyl n-propyl SO2NH2 SO2Cl NH-phenyl CH2CH3 cyclopropyl NO2 NH2 N(Et)2 i-propyl NHCH3 N(Me)2 OAc OCONHCH3 OCON(Me)2 t-butyl azophenyl OH O-phenyl OCH2CH3 OCH3 F NO

phenyl CH2OH CH2CN I H cinnamoyl CH2Cl CON(Me)2 azophenyl CONH2 SO2NH2 CF3 CO-phenyl COOCH2CH3 CN COOH COOCH3 acetyl COCH2CH3 SO2CH3 CHO SO2Cl NO2 COCl NO

3. RESULTS The molecular scaffolds considered in this work are shown in Figure 2. All have two variable substituent sites X and Y, with the exception of the trisubstituted methyl carbon atom in (f), with three sites X, Y, and Z. The set of chemical moieties and resulting number of moieties N on each site are chosen such that at least ∼40% of all possible chemical shifts in each molecular family have experimental measurements in the KnowItAll database. Below, the optimally reordered landscapes are presented for each scaffold (b) through (f) in Figure 2 (the landscape for the vinyl scaffold (a) in Figure 2 is shown in Figure 3b). The optimal ordering of moieties is presented for each molecular family as well. For the structures (b) through (d) in Figure 2, the predictive capability of NMRscape is compared to existing empirical methods. Predictions were not made for scaffold (e) because an insufficient amount of data was available to warrant comparison to existing prediction methods. The scaffold (f) with three sites is treated in a different fashion described later. 3.1. 13C Shift of ipso-Carbon on para-Disubstituted Benzene Compounds. For the para-disubstituted benzene scaffold in Figure 2b, N = 50 moieties on each site generate the shift landscape. Because the sites X and Y are not chemically equivalent with respect to the red carbon atom in Figure 2b, N2 = 2500 distinct chemical shifts can be generated from this library of N(N + 1)/2 = 1275 molecules, of which 1215 shifts are available in the KnowItAll database. The optimal moiety order was identified by first using the sorting method in eq 3, followed by GA optimization of eq 1, where one of the 1000 initial orderings for the GA was taken to be the ordering obtained by sorting. This two-step process produced the optimal moiety orderings for the ipso site Y and para site X given in Table 1 and the resultant landscape in Figure 4a. The

MODGRAPH database, but are present in the KnowItAll database. NN makes predictions based on the collective behavior of all compounds in the database, instead of selecting the experimental chemical shift value for compounds in the MODGRAPH database. Thus, NN predictions in this work are made for all molecules, including those in the MODGRAPH database. Assessment (2) is also examined with HOSE in section 3.1, where reduced data sets were constructed in the NMR-DB software package incorporated into MODGRAPH.29 We demonstrate that NMRscape predicts chemical shifts more accurately than either HOSE or NN for three molecular families having extensive available spectral data. Other families of molecules could be assessed as well upon acquisition of sufficient spectral data and appropriate definition of the variables. Although we will demonstrate the superior accuracy of NMRscape predictions, the issue of computational expense for any prediction method is important. Leaving aside the significant time cost of measuring experimental chemical shifts, the most significant computational expense for NMRscape is running the GA (several hours or minutes, depending on the computer), after which the chemical shifts of any number of unknown compounds may be predicted simultaneously upon polynomial fitting, which requires negligible computational time. In the MODGRAPH software, performing a chemical shift prediction with both HOSE and NN of each unknown compound takes ∼5−30 s on a desktop computer, so the total effort is summed over the set of unknown compounds whose chemical shifts are predicted. In practice, experimental measurements and predictions may be performed in iterative cycles, where the GA for NMRscape would not necessarily be needed at each iterative step. 9148

dx.doi.org/10.1021/jp306353b | J. Phys. Chem. A 2012, 116, 9142−9157

The Journal of Physical Chemistry A

Article

the sorting method.23 Optimization with the GA using random initial orderings was found to produce a similar landscape with the maximum in the upper right-hand corner, indicating in this case that the sorting method reliably captures the underlying relationship between 13C chemical shift and the moieties. The increase in chemical shift value is asymmetric with respect to the substituent position: the ipso moiety Y affects the shift more than the para moiety X, as may be expected because the close proximity of the ipso moiety makes it exert a more direct electronic influence on the measured 13C chemical shift. Eight outliers (circled) are apparent on the landscape in Figure 4a. Three of these compounds were subsequently purchased from Sigma-Aldrich and their spectra measured in CDCl3 for control and comparison. In all cases, our experimental shifts are closer to the values predicted by NMRscape, as shown in Table 2. An interesting compound has iodine at the ipso position and SO2Cl at the para position. 13C NMR spectra for this compound were measured in this work in pure CDCl3, as well as a 1:1 mixture of CDCl3 and DMSO. The 13 C shift undergoes a sudden transition from 103.8 ppm in pure CDCl3 to 95 ppm for a 1:1 solvent mixture of DMSO:CDCl3 (note that the latter value is similar to the 96.5 ppm recorded in the KnowItAll database). This is an example of a particularly strong solvent effect for 13C NMR. Upon correction of the three measured outlier shifts and removal of the remaining outliers, the landscape is shown in Figure 4b. This landscape has the improved MAE value of 1.04 ppm upon fitting with a ninth order polynomial, as compared to 1.18 ppm when the outliers are not removed. Assessment (1) of NMRscape from section 2 is compared with the SCS, HOSE, and NN methods. Because the sorting algorithm captures the landscape relationship between the chemical shift and the moieties, it is used to guide the generation of fitted landscapes from the data, as described in section 2. Since the two sites X and Y are not chemically equivalent, each compound where X ≠ Y contributes two distinct chemical shifts; thus, only one of these was removed for prediction. The truth plots for each prediction method are shown in Figure 5: SCS (a), HOSE (b), NN (c), and NMRscape (d). The SCS method produces many outliers and significant deviation at low chemical shift values. Most of the compounds are present in the MODGRAPH database used for HOSE, so the HOSE method does not actually predict their chemical shifts. Thus, only the predicted chemical shifts for the 127 compounds not present in the MODGRAPH database are shown in Figure 5b. The NN method shows very poor

Figure 4. (a) Chemical shift landscape for para-disubstituted benzene compounds with outlier shift values circled. (b) Landscape determined upon removal of outliers. The optimal moiety order was found through the two-step process of sorting followed by GA optimization described in section 2. The maximum is in the upper right-hand corner of the landscape, which is typical of all sorting-based landscapes (see also Figure 3c). The optimal orderings are given in Table 1.

landscape has an overall monotonic increase in chemical shift upon moving to the top and toward the right. This type of landscape structure always arises for optimal orderings based on

Table 2. para-Disubstituted Benzene Compounds Producing Outlier 13C Chemical Shiftsa ipso

para

predicted

database

solvent

exptl (CDCl3)

CH2OH I N(CH3)2 I cyclopropyl I NO NO

COOH SO2Cl COCl NNPh OCH3 NO F NO

146.54 101.92 155.12 97.09 139.82 120.35 157.99 161.1

149.63 96.5 164.8 88.23 148.07 147.1 143.8 150.82

DMSO CDCl3/DMSO CDCl3 neat CD2Cl2 CDCl3 CDCl3 CDCl3

147.00 103.8b 151.98

The NMR spectra of the first three compounds listed were taken in CDCl3 as a part of this work (labeled “exptl (CDCl3)”). The discrepancy between the NMRscape predicted and database values of the first and second compounds arises from the choice of solvent, but the cause of the discrepancy for the third compound is unknown. The remaining compounds were unavailable for purchase, and the discrepancies are assumed to arise from solvent effects or data errors. bThe measured chemical shift in a 1:1 mixture of CDCl3 and DMSO was 95 ppm. a

9149

dx.doi.org/10.1021/jp306353b | J. Phys. Chem. A 2012, 116, 9142−9157

The Journal of Physical Chemistry A

Article

Figure 5. Truth plots for shift prediction for para-disubstituted benzene compounds with (a) SCS increments, (b) HOSE, (c) NN, and (d) NMRscape. The plot for HOSE in panel b contains only chemical shifts from the 127 compounds not contained in the MODGRAPH database. In panel c, for NN, the shifts predicted at 92 ppm with widely varying experimental values (indicated by the arrow) are for compounds with iodine as the ipso moiety. Large deviations are also observed for SCS and HOSE in panels a and b. In contrast, there are no significant outliers in the NMRscape plot (d).

prediction quality when iodine is the ipso moiety (as shown by the vertically aligned points in Figure 5c at 92 ppm). The NMRscape method clearly yields the best truth plot. Both the MAE and maximal deviation dmax over the set of shifts characterize the prediction quality; Table 3 shows that NMRscape outperforms all three other methods.

Assessment (2) from section 2 was evaluated for NMRscape and HOSE. Using the NMR-DB software in MODGRAPH,29 two training databases were randomly chosen containing k = 1000 and k = 700 shifts; the remaining 1215 − k shifts were predicted with HOSE. The same two training sets were employed for prediction with NMRscape as well, with the resulting truth plots for both methods provided in Figure S1 of the Supporting Information. The NMRscape results are clearly superior to those from HOSE, as shown by the statistics for these predictions in Table 3. Assessment (2) for NMRscape was also examined for training sets containing from k = 200 out to k = 1150 chemical shifts, where 1000 unique training sets were constructed for each k, as described in section 2. The results as a function of k are shown in Figure 6, which presents the statistically expected MAE over the 1000 training sets for each k along with error bars, as well as the MAE of the best training set. For reference, the HOSE prediction MAE values are shown as a function of k as well, where the point at k = 1215 is from the HOSE prediction using all available data (i.e., the second row in Table 3). Even for k = 300 (12% of the possible shifts on the landscape and 25% of available experimental data), the NMRscape method can produce MAE values comparable to or better than the HOSE and NN methods using full databases, as can be seen by comparing the MAE values from NMRscape and HOSE in Figure 6 and Table 3 3.2. 13C Shift of Methylene Compounds CH2XY. The next molecular family examined consists of CH2XY compounds with X and Y as variable moieties. As the scaffold in this case is a single methylene group, most of the moieties (c.f., Table 4)

Table 3. Prediction Statistics for SCS, HOSE, NN, and NMRscape Methods for the 13C Chemical Shifts of paraDisubstituted Benzene Compoundsa method

shifts in database

MAE

dmax

SCS HOSE NN NMRscape HOSE NMRscape HOSE NMRscape

946 ∼150000b ∼100000b 1214 1000 1000 700 700

1.91 2.60c 1.70 1.02 2.61 1.14 2.74 1.24

22.7 13.3 18.0 6.38 11.59 3.66 17.54 7.82

a

Statistics are shown both for all available data and with reduced training data sets for the HOSE and NMRscape methods. The low dmax for NMRscape with 1000 shifts in the training database arises because molecules exhibiting larger deviations upon prediction with the full data set were included in the training database, and thus, their chemical shifts were not predicted. bThe HOSE and NN databases contained additional molecules beyond those of the scaffold structure. c The MAE value for HOSE is taken over the 127 compounds not present in the MODGRAPH database. 9150

dx.doi.org/10.1021/jp306353b | J. Phys. Chem. A 2012, 116, 9142−9157

The Journal of Physical Chemistry A

Article

Figure 7. Chemical shift landscape for the methylene scaffold in Figure 2c (CH2XY compounds). The optimal moiety order was identified through the two-step process of sorting followed by GA optimization (c.f., section 2). The landscape is symmetric due to the chemical equivalence of the two variable sites, with the maximum in the upper right corner. The optimal ordering is given in Table 4.

Figure 6. Prediction error as a function of training set size for NMRscape with para-disubstituted benzene compounds. The expected MAE values reflect the mean and standard deviation over the sample of 1000 training sets of each size k. The best MAE value is from the training set of each size k producing the lowest MAE value. For comparison, the prediction error for HOSE training sets is shown by the squares.

values rise monotonically towards the upper right corner of the landscape, as was the case for para-disubstituted benzene compounds. Use of the GA with random initial orderings was found to produce a similar landscape structure, indicating that sorting captures the underlying landscape relationship between the chemical shift and the moieties. The predictive capability of NMRscape using the sorting algorithm was tested against the SCS, HOSE, and NN methods. The four truth plots are given in Figure 8a−d for SCS, HOSE, NN, and NMRscape, respectively. The SCS truth plot shows a systematic overestimation at low and high chemical shifts when heteroatoms are involved, even though heteroatom corrective terms are applied.26 For HOSE, the majority of the shifts are

are significantly larger in size than the scaffold. The results from this family of molecules demonstrate that NMRscape can function well with a small scaffold, where moiety steric effects may be significant. N = 65 moieties are chosen, for a total N(N + 1)/2 = 2145 unique chemical shifts in the library because the two substituent sites are equivalent. Of these, 1038 shifts are available in the KnowItAll database and used for landscape construction. The ordering that produces the smoothest landscape, presented in Table 4, was determined with sorting followed by GA optimization. The resulting landscape, which is symmetric due to the chemical equivalence of the two substituent sites, is shown in Figure 7. The chemical shift

Table 4. Optimal Ordering for the Moieties of Methylene Compounds in Figure 2c order

moiety

order

moiety

order

moiety

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 6 17 18 19 20 21 22

I Sn(Me)3 H CN CC Si(Me)3 CH2CHO SH CH3 CH2Ac CH2COOH CH2CN Br 2-furyl i-butyl allyl CH2OCH3 n-propyl n-butyl o-hydroxyphenyl amino acid benzyl

23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44

CH2CH3 oxirane CH2OH CH2Cl CH2Br CH2NH2 CC COOH COOCH3 COOCH2CH3 p-hydroxyphenyl CONH2 p-nitrophenyl phenyl phenyl acetal NHCOPh cyclohexyl COPh 2-pyridyl NHAc i-propyl CH(CH3)Cl

45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

NH2 Cl NHPh Ac C(Me)2OH t-butyl NHCH2CH3 NHCH3 N(Et)2 SO2CH3 SO2Ph N(Me)2 OH OCHO OAc OPh O-vinyl OCH2CH3 OCH3 NO2 F

9151

dx.doi.org/10.1021/jp306353b | J. Phys. Chem. A 2012, 116, 9142−9157

The Journal of Physical Chemistry A

Article

Figure 8. Truth plots for 13C shift prediction for the methylene scaffold (c) in Figure 2 with (a) SCS increments, (b) HOSE, (c) NN, and (d) NMRscape. The SCS plot in panel a produces a systematic underestimation for most of the chemical shifts. Both HOSE and NN in panels b and c exhibit multiple outliers. Only the 194 compounds not present in the MODGRAPH database are included on the HOSE plot (b). The only significant outlier for the NMRscape plot (d) is the compound with X = Y = iodine, which is also an outlier on the other plots as well.

represented in the databases available in MODGRAPH, thereby obviating the ability to make predictions for those cases. Only the predictions for the 194 compounds not present in the HOSE database are included in the statistical analysis. NMRscape outperforms all three empirical methods, as shown in Table 5 and Figure 8d. Table 5. Prediction Statistics for SCS, HOSE, NN, and NMRscape Methods for the 13C Chemical Shifts of Methylene Compounds method

shifts in database

MAE

dmax

SCS HOSE NN NMRscape

774 ∼150000a ∼100000a 1037

2.64 3.59b 1.86 1.42

22.3 18.87 25.91 16.07

a

The HOSE and NN databases contained additional molecules beyond those of the scaffold structure. bThe MAE value for HOSE is taken over the 194 compounds not present in the MODGRAPH database.

Figure 9. Prediction error as a function of training set size for NMRscape with methylene compounds. The expected MAE values reflect the mean and standard deviation over the sample of 1000 training sets of each size. The best MAE value is from the training set of each size k producing the lowest MAE value.

Assessment (2) was evaluated for NMRscape utilizing the same methods as for the para-disubstituted benzene scaffold. The expected and best MAE values as a function of training set size k, from k = 210 through k = 970, are shown in Figure 9 with the error bars generated from the 1000 training sets for each k. NMRscape data sets with as few as k = 520 shifts (24% of all possible shifts) produced better expected MAE values than SCS, HOSE, and NN using all databases. Even the best chosen training data set containing only k = 260 shifts (12% of all possible shifts) produces a MAE of approximately 2 ppm. The efficacy of NMRscape predictions for such a small scaffold using very sparse data sets demonstrates the potential broad utility of the method.

3.3. 13C Shift of Carbonyl Compounds. The carbonyl scaffold in Figure 2(d) is also small in size compared to most of the chemical moieties on the scaffold (c.f., Table 6). For N = 60 moieties and chemical equivalence of the two variable sites, there are N(N + 1)/2 = 1830 unique chemical shifts from the same number of molecules in this family. Of these, experimental data were available for 806 compounds, including data from the KnowItAll database19 and the chemical shifts of 12 compounds measured in this work, as listed in Table S1 of the Supporting Information. Unlike the para-disubstituted 9152

dx.doi.org/10.1021/jp306353b | J. Phys. Chem. A 2012, 116, 9142−9157

The Journal of Physical Chemistry A

Article

Table 6. Optimal Ordering for Carbonyl Compounds in Figure 2d order

moiety

order

moiety

order

moiety

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Cl O-t-butyl O-i-propyl O-n-butyl OCH3 O-ethyl O-n-hexyl O-phenyl NH-phenyl O-vinyl NH-p-chlorophenyl O-allyl O-propyl NHCH3 N(CH3)2 NH2 OH O− 2-furyl trifluoromethyl

21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

trichloromethyl 2-thienyl dichloromethyl vinyl-2-furyl cinnamoyl chloromethyl 1-naphthtyl benzyl n-hexyl n-pentyl n-butyl cyclohexyl t-butyl i-propyl sec-butyl cyclopropyl ethyl n-propyl i-butyl methyl

41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

o-hydroxyphenyl o-toluyl 1-methylvinyl p-hydroxyphenyl phenyl p-toluyl p-aminophenyl p-fluorophenyl p-chlorophenyl p-methoxyphenyl vinyl 3-pyridyl 4-pyridyl 2-pyridyl H 3-furyl CH2CN imidazolyl 2-pyrrolyl N-methyl 2-pyrrolyl

benzene and methylene scaffolds, the landscape relationship between the carbonyl chemical shift and the moieties is not effectively captured by the sorting algorithm. This result is likely due to cooperativity effects between the two moieties playing a larger role in carbonyl compounds, as the sorting algorithm only takes cooperativity into account implicitly, while the cross-terms in the polynomial fitting functions employed in GA optimization explicitly capture cooperativity effects. Upon sorting followed by GA optimization, the landscape has its maximum in the upper right corner, as shown in Figure S2 of the Supporting Information, with an MAE of 1.49 ppm upon fitting to a 12th order polynomial. The landscape resulting from GA optimization with initial random orderings, in contrast, has the maximum near the center of the variable space, as shown in Figure 10, as well as a significantly better MAE value of 1.38 ppm upon fitting to a 12th order polynomial. The optimal moiety order found with the GA is presented in Table 6. Assessment (1) for NMRscape was compared with the HOSE and NN methods (SCS increments were not available for the majority of the compounds). Because the optimal ordering must be found with the GA, the ordering obtained with the full data set of 806 chemical shifts was employed to generate the fitted surfaces for prediction of each chemical shift, as described in section 2. Truth plots for HOSE, NN, and NMRscape predictions are shown in Figure 11a−c, respectively. As was the case in the two previous molecular families in sections 3.1 and3.2, the chemical shifts of many compounds were present in the MODGRAPH database, so only the 124 compounds not available in the latter database are considered here for HOSE. Visual examination of the truth plots in Figure 11 shows that NMRscape contains no significant outlier points, while both HOSE and NN have 12 and 9 outliers with deviations exceeding 8 ppm, respectively. The MAE and maximal deviation dmax for each truth plot are presented in Table 7; NMRscape significantly outperforms both HOSE and NN. Assessment (2) was tested for NMRscape with training sets containing k = 250 through k = 620 of the available 806 chemical shifts, where one training set of each size k was

Figure 10. Chemical shift landscape for carbonyl compounds (scaffold (d) in Figure 2) using the order of moieties deduced from the GA initialized with randomly generated orderings. The global maximum lies near the center of the landscape, unlike the landscapes in Figure 4 and Figure 7.

generated by random removal of the appropriate number of compounds from the full data set. As described in section 2, the GA was used to find the optimal moiety order for each training set. The MAE values obtained upon predicting the 806 − k chemical shifts as a function of k are shown in Figure 11d. The prediction quality gradually decreases with smaller training set sizes, but even with k = 250 shifts (13% of the 1830 total shifts), the MAE is superior to both HOSE and NN with their exploiting access to all experimental data (c.f., Table 7). This result further indicates that very sparse sampling of spectral data can generate accurate predictions with NMRscape, which provides a significant potential advantage over existing empirical methods. 3.4. Further Applications. Both NMR landscape construction and chemical shift prediction with NMRscape illustrated above may be generalized in many ways, two of 9153

dx.doi.org/10.1021/jp306353b | J. Phys. Chem. A 2012, 116, 9142−9157

The Journal of Physical Chemistry A

Article

Figure 11. Truth plots for predicting 13C chemical shifts of carbonyl carbons: (a) HOSE, where the predicted and experimental chemical shifts are plotted only for the 124 compounds that were not available in the MODGRAPH database. (b) NN, where all predicted and experimental chemical shifts are plotted, even those for which the compounds were available in the MODGRAPH database. (c) NMRscape. The plots a and b have many outliers with deviations of greater than 8 ppm, while the NMRscape in plot c clearly produces more accurate predictions. (d) MAE for NMRscape prediction as a function of training set size, which shows that very accurate predictions with MAE < 2 ppm can even be made with 250 ≈ 13% of all 1830 possible chemical shifts in the database.

moieties demonstrates an extension of NMRscape in terms of the definition of variables. Three compounds listed in Table S2 of the Supporting Information (i.e., out of the total of 24 used in this assessment) were purchased from Sigma-Aldrich and their spectra measured in CDCl3 as a part of this work. Given the small size of the library of molecules, it was possible to order the moieties by inspection, and the resulting landscape in Figure 12 appears monotonic. For scaffolds with more than two functionalized sites, special reordering and landscape representation methods will typically be required. Cooperativity effects may be taken into account explicitly using the high-dimensional model representation (HDMR) method, which can identify the significant

Table 7. Prediction Statistics of the HOSE, NN, and NMRscape Methods for 13C Chemical Shifts of Carbonyl Compounds method

# of shifts in database

MAE

dmax

HOSE NN NMRscape

∼150000a ∼100000a 805

3.83b 1.96 1.21

17.67 19.19 7.72

a

The HOSE and NN databases contained additional molecules beyond those of the scaffold structure. bThe MAE value for HOSE is taken over the 124 compounds not present in the MODGRAPH database.

which are addressed here. First, the landscape for a complex cholesteryl scaffold is presented. Despite the complexity of the scaffold, the chemical shift of the target methyl carbon (c.f., Figure 2e) is shown to generate a monotonic landscape. Second, any number of scaffold sites may be employed with NMRscape, with the dimensionality of the resulting landscape growing accordingly;15 a simple three-dimensional example is examined here. Spectral data in the KnowItAll database were collected for the cholesteryl scaffold with variable sites at positions X and Y (c.f., Figure 2e), where the chemical shift of the carbon on the methyl group indicated in red is targeted. Eight moieties at site X and five moieties at site Y were available, with stereochemistry defining the moiety in addition to its chemical identity (e.g., Z-CH3 is a separate moiety from E-CH3, where stereochemistry is denoted with respect to the target methyl group). The use of stereochemical factors in describing distinct

Figure 12. Chemical shift landscape for the cholesteryl scaffold in Figure 2e. Moieties are labeled on the axes, including stereochemical character. The landscape was ordered by inspection. 9154

dx.doi.org/10.1021/jp306353b | J. Phys. Chem. A 2012, 116, 9142−9157

The Journal of Physical Chemistry A

Article

NMRscape rests on the fundamental principles of physical observable landscape theory,18 embodied by OptiChem.16,17 Upon satisfaction of the assumptions of OptiChem theory, the 13 C chemical shift landscape should be monotonic with proper ordering of the substituent moieties. The illustrations in this article confirm this behavior, implying satisfaction of the key assumptions, thereby opening up the prospect of extracting physical information from the optimal moiety orderings. Below, we demonstrate that the landscapes revealed in this work uncover both existing and new rules relating NMR shifts to the chemical properties of the moieties. The relationship between chemical shift values and the moieties on a scaffold has been widely investigated. SCS increments3 have been determined for many species, including the classes of compounds examined here.13,30−32 A related approach is to correlate chemical shift values to chemometric descriptors such as electronegativity,33,34 π-electron density,35 and Hammett constants.36,37 NMRscape is distinct from these studies in that (i) the chemical features of the moieties are not explicitly used for determination of their functional relationship to the shift value and that (ii) the presence of possibly complex cooperativity between moieties is taken into account, guided by the principles of OptiChem theory with reordering of the moieties and multivariate polynomial fitting of the landscape. The optimal orderings presented in section 3 confirm known empirical rules about the relationship between NMR chemical shifts and moieties on a scaffold for molecular families where SCS increments are available. Comparing the ordering of moieties by NMRscape with SCS increments reveals striking similarities for both the para-disubstituted benzene and methylene families, as shown in Tables S3 through S5 in the Supporting Information. This agreement between the moiety orderings discovered through NMRscape and known SCS increments demonstrates the ability of NMRscape to discover (known) rules while operating with no prior knowledge of the relevant chemical characteristics of the moieties involved. In addition, this prospect suggests that NMRscape may be used to identify new rules where none are known a priori. The moiety ordering obtained for the carbonyl compounds in this article illustrates this point. Prior work on predicting carbonyl chemical shifts considered single classes of compounds (e.g., separate SCS increments for aldehydes, ketones, acids, and esters3 or for substituted benzaldehydes and acetophenones13). The moiety ordering revealed by NMRscape covers all of these classes of compounds, as well as amides, carbonates, ureas, etc., with each class of compounds occupying a specific region of the resulting symmetric landscape in Figure 10. Carbonates and ureas occupy the bottom left corner of the landscape, while esters and amides stretch to the lower right and upper left corners. Ketones occupy the center to upper right corner, with alkyl moieties producing the highest chemical shifts clustered in the middle. These landscape regions are shown in Figure S2 of the Supporting Information. The physical basis of this proper ordering of chemically diverse moieties for carbonyl compounds can be understood in terms of electronic effects and will be considered in a future publication.38

correlations among the independent substituent sites needed to describe the 13C shift and also identifies optimal moiety orderings using the identified correlations.15 While explicit use of HDMR is beyond the scope of this work, a simple example of a three-site scaffold amenable to sorting is examined here. The sorting method for landscape construction can be generalized to an arbitrary number of scaffold sites, as each site is ordered independently upon averaging over the contribution of the other sites. In this fashion, the sorting procedure only implicitly includes cooperativity through the averaging procedure in eq 3. Using the sorting method, a landscape was generated for the methine scaffold (CH) in Figure 2f with three substituent sites X, Y, and Z. The molecules are of the form CHXYZ. Nominally, scaffold (c) is a restricted case of scaffold (f), but the available spectral databases did not provide sufficient experimental data for including the large number of moieties employed for scaffold (c) in section 3.2. Instead, N = 16 moieties had sufficient spectral data, yielding N(N + 1)(N + 2)/6 = 816 unique molecules due to the 3-fold degeneracy of the sites on the scaffold. Of these, 231 shifts are available in the KnowItAll database. The optimal moiety order and the resulting landscape are shown in Figure 13, with each shift plotted as a sphere with

Figure 13. (a) Optimal moiety ordering for the methine (CHXYZ) scaffold in Figure 2(f). (b) The associated three-dimensional landscape. The moiety sites labeled as X, Y, and Z are equivalent. Only the unique shifts are shown; the landscape has 6-fold symmetry. The size of each sphere is inversely proportional to its distance from the origin (1,1,1), corresponding to CHI3. The landscape appears to be monotonic.

the color specifying the shift value and the sphere size indicating distance from the origin. Only the unique shifts are shown for clarity. This landscape also appears to be monotonic.

4. EXTRACTING RULES FROM THE CHEMICAL SHIFT LANDSCAPES The results presented in this paper demonstrate the capability of NMRscape to predict 13C chemical shifts for diverse families of molecules with an accuracy exceeding that of the empirical HOSE, SCS, and NN methods. NMRscape can also reveal the systematic relationship between a 13C chemical shift and variable moieties using just the chemical identity of the moieties.

5. CONCLUSIONS This work presented NMRscape as a means to predict 13C NMR shifts for families of molecules without relying on a priori knowledge about the chemical properties of the contributing moieties. NMRscape rests on the fact that a unique integer labeling of the moieties in a library inherently forms a complete 9155

dx.doi.org/10.1021/jp306353b | J. Phys. Chem. A 2012, 116, 9142−9157

The Journal of Physical Chemistry A



set of variables to capture the relationship between the moieties and the NMR shift. If a sufficient number of physically reasonable variables are employed, the underlying foundations of OptiChem theory16,17 imply that a smooth, monotonic relationship should exist for the NMR shift as a function of the variables when an optimal ordering is found for the moieties corresponding to each variable. The resulting 13C chemical shift landscape deduced from a sparse sampling of the moieties can be used to predict the chemical shifts of compounds containing combinations of moieties for which no prior spectral data is available. NMRscape was found to predict 13C chemical shifts of para-disubstituted benzene derivatives, methylene compounds, and carbonyl compounds more accurately than the HOSE, NN, and SCS methods. NMRscape takes into account cooperativity effects between moieties without requiring prior knowledge of their chemical properties or their assumed interactions. As a result, chemical rules relating moieties to the NMR shifts may be revealed even if no existing rules are available. Close agreement between the optimal orderings for NMRscape and SCS increments for para-disubstituted benzene derivatives and methylene compounds was identified, showing that NMRscape is able to confirm known rules relating moieties to chemical shifts. Furthermore, the NMRscape moiety ordering for carbonyl compounds revealed new rules applicable across diverse classes of carbonyl derivatives. NMRscape relies on an available database of chemical shifts to construct the landscape. For molecular scaffolds with two functionalized sites, the results from employing reduced training sets show that only ∼10−20% of all possible compounds need to be present in order to construct a suitable landscape for prediction and analysis. HDMR treatment of similar molecular scaffold problems indicates that the fractional size of the training set actually decreases with additional functionalized sites.15 By combining these features, NMRscape could be employed to guide the generation of a minimally sized portion of a library for efficient estimation of the shifts over the remainder of the library. Finally, NMRscape should also be applicable to predict and analyze the chemical shifts from any other NMR-active nucleus, as well as other spectral properties, such as IR vibrational frequencies, as will be examined in a future work.38



REFERENCES

(1) Elyashberg, M.; Williams, A.; Blinov, K. Contemporary ComputerAssisted Approaches to Molecular Structure Elucidation; New Developments in NMR; Royal Society of Chemistry: Cambridge, UK, 2012. (2) Paul, E.; Grant, D. J. Am. Chem. Soc. 1963, 85, 1701. (3) Pretsch, E.; Bühlmann, P.; Badertscher, M. Structure Determination of Organic Compounds, 4th ed.; Springer-Verlag: Berlin, Germany, 2009. (4) Chen, L.; Robien, W. Anal. Chem. 1993, 65, 2282−2287. (5) Bremser, W. Anal. Chim. Acta 1978, 2, 355−365. (6) Meiler, J.; Meusinger, R.; Will, M. J. Chem. Inf. Comput. Sci. 2000, 40, 1169−1176. (7) Barone, V.; Cimino, P.; Crescenzi, O.; Pavone, M. J. Mol. Struct. 2007, 811, 323−335. (8) Elyashberg, M.; Blinov, K.; Smurnyy, Y.; Churanova, T.; Williams, A. Magn. Reson. Chem. 2010, 48, 219−229. (9) Meiler, J.; Maier, W.; Will, M.; Meusinger, R. J. Magn. Reson. 2002, 157, 242−252. (10) Smurnyy, Y. D.; Blinov, K. A.; Churanova, T. S.; Elyashberg, M. E.; Williams, A. J. J. Chem. Inf. Model. 2008, 48, 128−134. (11) Blinov, K. A.; Smurnyy, Y. D.; Churanova, T. S.; Elyashberg, M. E.; Williams, A. J. Chemom. Intell. Lab. Syst. 2009, 97, 91−97. (12) Baladina, A.; Kalinin, A.; Mamedov, V.; Figadere, B.; Latypov, S. Magn. Reson. Chem. 2005, 43, 816. (13) Patterson-Elenbaum, S.; Stanley, J. T.; Dillner, D. K.; Lin, S.; Traficante, D. Magn. Reson. Chem. 2006, 44, 797−806. (14) Hereafter in this article, the chemical shift will be considered as a function of a set of variables in order to distinguish from parameters, which nominally may be viewed as adjustable empirical quantities. (15) Izmailov, S.; Feng, X.-J.; Li, G.; Rabitz, H. J. Math. Chem. 2012, 50, 1765−1790. (16) Moore, K. W.; Pechen, A.; Feng, X.-J.; Dominy, J.; Beltrani, V.; Rabitz, H. Chem. Sci. 2011, 2, 417. (17) Moore, K. W.; Pechen, A.; Feng, X.-J.; Dominy, J.; Beltrani, V.; Rabitz, H. Phys. Chem. Chem. Phys. 2011, 13, 10048. (18) Pechen, A.; Rabitz, H. Europhys. Lett. 2010, 91, 60005. (19) KnowItAll AnyWare, version 3.7; Bio-Rad Laboratories, Inc., Informatics Division: Hercules, CA . (20) Kalinowski, H.-O. Carbon-13 NMR Spectroscopy; John Wiley and Sons: New York, 1988. (21) Shenvi, N.; Geremia, J. M.; Rabitz, H. J. Phys. Chem. A 2003, 107, 2066. (22) Liang, F.; Feng, X.; Lowry, M.; Rabitz, H. J. Phys. Chem. B 2005, 109, 5842. (23) McAllister, S. R.; Feng, X.; DiMaggio, P. A.; Floudas, C. A.; Rabinowitz, J. D.; Rabitz, H. A. Bioorg. Med. Chem. Lett. 2008, 18, 5967. (24) Briehn, C. A.; Schiedel, M. S.; Bonsen, E. M.; Schuhmann, W.; Baeuerle, P. Angew. Chem. 2001, 40, 4680. (25) Neuvonen, H.; Neuvonen, K.; Pasanen, P. J. Org. Chem. 2004, 69, 3794−3800. (26) Fürst, A.; Pretsch, E.; Robien, W. Anal. Chim. Acta 1990, 233, 213. (27) Goldberg, D. E. Genetic Algorithms in Search, Optimization, and Machine Learning; Kluwer Academic Publishers: Boston, MA, 1989. (28) Kohavi, R. International Joint Conference on Artificial Intelligence; Morgan Kaufmann: Burlington, MA, 1995; pp 1137−1143. (29) NMRPredict, version 4; Modgraph Consultants, Ltd.: Herts, U.K., 2008. (30) Dostovalova, V.; Fedorov, L.; Paasivirta, J. Magn. Reson. Chem. 1991, 29, 830−833. (31) Grant, D.; Paul, E. J. Am. Chem. Soc. 1964, 86, 2948. (32) Somayajulu, G.; Kennedy, J.; Vickrey, T.; Zwolinski, B. J. Magn. Reson. 1979, 33, 559−568. (33) Spiesecke, H.; Schneider, W. J. Chem. Phys. 1961, 35, 731. (34) Spiesecke, H.; Schneider, W. J. Chem. Phys. 1961, 35, 722. (35) Olah, G.; Mateescu, G. J. Am. Chem. Soc. 1970, 92, 1430. (36) Cornelis, A.; Lambert, S.; Laszlo, P. J. Org. Chem. 1977, 42, 381.

ASSOCIATED CONTENT

S Supporting Information *

Additional figures and tables, including 13C NMR chemical shifts measured in this work, truth plots, and comparisons of the moiety orderings found by NMRscape with moiety orderings obtained from SCS increments. This material is available free of charge via the Internet at http://pubs.acs.org.



Article

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS We acknowledge support from NSF and DARPA. K.W.M. acknowledges the support of an NSF graduate research fellowship. We also thank Gregory Banik at Bio-Rad Laboratories, Inc. for helpful discussion about working with their database. 9156

dx.doi.org/10.1021/jp306353b | J. Phys. Chem. A 2012, 116, 9142−9157

The Journal of Physical Chemistry A

Article

(37) Bromilow, J.; Brownlee, R.; Lopez, V.; Taft, R. J. Org. Chem. 1979, 44, 4766. (38) Moore, K. W.; Li, R.; Pelczer, I.; Rabitz, H. Manuscript in preparation.

9157

dx.doi.org/10.1021/jp306353b | J. Phys. Chem. A 2012, 116, 9142−9157