Optimization and Automation of Quantitative NMR Data Extraction

May 15, 2013 - The necessary experimental procedures to acquire quantitative data are ... “Click” analytics for “click” chemistry – A simple...
0 downloads 0 Views 2MB Size
Article pubs.acs.org/ac

Optimization and Automation of Quantitative NMR Data Extraction Michael A. Bernstein,*,† Stan Sýkora,‡ Chen Peng,† Agustín Barba,† and Carlos Cobas† †

Mestrelab Research, S.L Feliciano Barrera 9B − Baixo, 15706 Santiago de Compostela, Spain Extra Byte, Castano Primo, Italy



ABSTRACT: NMR is routinely used to quantitate chemical species. The necessary experimental procedures to acquire quantitative data are well-known, but relatively little attention has been applied to data processing and analysis. We describe here a robust expert system that can be used to automatically choose the best signals in a sample for overall concentration determination and determine analyte concentration using all accepted methods. The algorithm is based on the complete deconvolution of the spectrum which makes it tolerant of cases where signals are very close to one another and includes robust methods for the automatic classification of NMR resonances and molecule-to-spectrum multiplets assignments. With the functionality in place and optimized, it is then a relatively simple matter to apply the same workflow to data in a fully automatic way. The procedure is desirable for both its inherent performance and applicability to NMR data acquired for very large sample sets.

Q

conditions10 or indirectly using an internal reference signal that represents a known concentration. This signal may be synthetically derived, as is the case with ERETIC1c and QUANTAS,11 or relative to an internal reference material.12 To facilitate this analysis, the internal chemical species used is often the residual signal from the deuterated solvent13 or the chemical shift reference compound.14 With purity determinations, the analyte is typically codissolved with a reference material of known purity, structure, and spectroscopic properties, and the analyte purity is still determined principally from the determined analyte concentration using a ratio method.12 The fundamental data extraction task relies on the accurate determination of absolute integrals for one or more signals that represent a known number of nuclides in the compound under analysis. For this, a spectrum must be well phased and not suffer from baseline distortions. This process typically uses manual determination of the signals to be integrated, setting of integral regions, and defining the number of nuclides (NN) for each integral region. Obtaining accurate signal integrals requires integration over a wide range, and this may be practically impossible where signals overlap or are very close or undesirable when the signal-to-noise ratio (SNR) is limiting. This can be a significant limitation to accurate integration, as noted by Yan and co-workers in their attempts to quantify material concentrations in compound libraries.7 Using our approach qNMR with high accuracy is possible even when there is signal congestion and interfering, extraneous peaks. We describe in this Article computer algorithms and functionality that form a workflow to address these processing

uantitative NMR (qNMR) is now widely applied1 in many scientific disciplines, from very pure compounds used in the pharmaceutical industry2 to mixtures of compounds in forensic samples.3 The method takes advantage of the fundamental property of NMR which produces a signal for any species that will have an area that is proportional to its concentration. This obviates the need to determine a compound-specific response factor, as is the case with UV detectors,4 for example. NMR, however, is an inherently lower sensitivity technique, and the equipment can be expensive. In an exhaustive validation, it was shown that the maximum combined measurement uncertainty is 1.5% for a confidence interval of 95%.5 With the wide acceptance of NMR now as the technique of choice in chemical compound quantitation, it follows that there are numerous descriptions for its applicability. These include pharmaceutical ingredients,2 natural products,6 and synthetic chemical compound libraries,7 where the sample is relatively simple because it typically has a single chemical species of interest. More complex mixtures may be analyzed by NMR to provide the concentration of known chemical components therein. These mixtures can also be quantified, and this is commonly employed in metabolomics and related studies.1a qNMR is used to accurately and precisely determine species’ concentrations as they change through the course of a chemical reaction: this is called reaction monitoring by NMR.8 qNMR is typically performed on compounds in solution. Particular consideration should be given to sample preparation and data collection conditions.6 The sample may be static in a tube or “on flow”.8b,9 qNMR may be used to determine compound concentration or purity. Concentration determination is typically accomplished either using an empirically derived “response factor” under particular experimental © 2013 American Chemical Society

Received: February 6, 2013 Accepted: May 15, 2013 Published: May 15, 2013 5778

dx.doi.org/10.1021/ac400411q | Anal. Chem. 2013, 85, 5778−5786

Analytical Chemistry

Article

Figure 1. (A) 1H NMR spectra of L-thyroxine in DMSO-d6 solution (600 MHz, 300K). The lower trace shows the spectrum after GSD peak picking and peak classification (see text) had been applied, and the upper spectrum shows the extracted solute spectrum that is effectively now used by the software for multiplet identification and integration. (B) 1H NMR spectra of moxifloxacin in DMSO-d6 solution (600 MHz, 300K). The GSD peaks are colored blue.

and analysis issues for a wide range of analysis conditions. We determine the average analyte concentration using one or more

of the best multiplets in the spectrum. We describe the available methods to determine whether or not a multiplet is suitable for 5779

dx.doi.org/10.1021/ac400411q | Anal. Chem. 2013, 85, 5778−5786

Analytical Chemistry

Article

concentration determination by not being “contaminated” with noncompound signals. With the fundamental design objectives achieved, automated application to large data sets or data as they are produced in real time is a relatively trivial extension of our method. A number of simple approaches have been described15 that process multiple qNMR determinations, and progress toward automated analysis has recently been reported.16 However, we believe our method is more general and can automatically select the best multiplets to report the average concentration.

function may also be used as it is very efficient at suppressing sinc-wiggle truncation artifacts.19 (2) The spectrum must be phased so all peaks have pure absorption line shape. The stringency for this requirement will depend partly on the necessity for accuracy and the SNR. In many cases, the performance of automated phasing procedures in the software is adequate. The need for a very carefully baseline corrected20 spectrum is relaxed when Global Spectrum Deconvolution21 (GSD, see peak picking below) is used for peak integration because the algorithm is insensitive to small phase errors and even large baseline distortions. Peak Picking. We support the use of either conventional peak picking, or GSD, the former not being recommended in this context because of the inaccuracy of the estimated peak parameters which, among other things, would compromise the performance of the signal classification and NN estimation. GSD was used by Bradley and co-workers to quantify nutrients and metabolites in cell cultures.22 GSD consists of the complete deconvolution of the frequency domain spectrum in order to obtain a reliable list of peaks and their parameters (chemical shift, peak height, line width at half height and line shape) even in situations characterized by a strong peak overlap (Figure 1). For effective GSD, ca. 5 data points or better are required to describe each peak. A key feature of GSD is that peak picking and optimization take place in the first and second derivative domains calculated using an optimized Savitzky−Golay algorithm23 where the number of points and polynomial order are calculated automatically. As a result, GSD provides both complete baseline independence and a marked resolution enhancement where even poorly resolved shoulders are converted into distinct peaks. GSD will always be performed as it enhances the automated workflow by providing full peak information on each and every peak in the spectrum. In the context of quantitation, the normalized area, or absolute integral (AI) in the arbitrary NMR scale, is of critical significance. GSD has the added, significant benefit of correctly integrating overlapping lines. In Figure 1A, we can see how GSD can be used when sharp solute lines overlap with a broad, residual solvent peak: here, the solute line information can be extracted by GSD. With solvent peaks alone (Figure 1B), we see the deconvoluted peaks (blue) in a more complex set of multiplets. In this case, the multiplets shown do not overlap and GSD-derived areas would therefore be usable. If, however, multiplets were overlapping then there is no automatic way to distinguish the resonance lines from each constituent multiplet, and this combination would be labeled as an undefined multiplet (“m”). In the second case, GSD is effective when peaks overlap or are very close but less so when multiplets overlap. Peak Classification. The information from GSD is used for peak classification. This algorithm considers each peak in the spectrum and classifies it as attributable to the principal compound, unspecified impurity, specified impurity, solvent, 13 C satellite, or a labile proton of the principal compound: this is an essential step in the overall workflow. This procedure relies on a fuzzy-logic scoring system concept that is used heavily throughout the whole peaks classification and automatic assignments procedure. A detailed description of this system is beyond the scope of this work and will be discussed in a future article. Peaks classification is especially useful information for contaminated spectra, as it allows the quantitation program to derive integrals for the analyte peaks even when they are



EXPERIMENTAL SECTION Validation samples were prepared using ±Ibuprofen (SigmaAldrich). Samples were weighed in a vial, and a volume of deuterated solvent (EURISO-TOP) was added. The sample was reweighed, and the mass of deuterated solvent was used to derive its volume and the Ibuprofen concentration. Solutions were prepared in triplicate using DMSO-d6, CDCl3, and CD3OD. Each NMR tube (Wilmad 527-PP-7) was capped and sealed with Parafilm to limit solvent evaporation. Spectra were recorded soon after sample preparation using a Bruker DRX500 MHz instrument, with an interpulse delay of 30 s to allow for full relaxation. In all instances, the probe was tuned and matched manually, and a 90° pulse-width determination was performed. The spectra of acetanilide and sucrose were those described by Farrant and co-workers.11 The spectra were carefully phased and baseline corrected, and integrals for each multiplet were determined using Mnova software.17 Concentrations were determined manually from the absolute integrals for the manual process. Mnova software and the qNMR plugin, which constitutes the kernel of the expert system presented in this work, were then used to automatically determine the concentrations for the same spectra.



RESULTS AND DISCUSSION In this Article, we describe the quantitation process for 1H NMR solution spectra, but the same principle applies to any 1D NMR spectrum. We start with a time-domain FID and perform all the necessary processing to obtain an input spectrum for quantitation, and the quantitation algorithm then automatically defines each multiplet and NN in the spectrum relating to the compound and determines the concentration. This process can be carried out in the absence of the correct molecular structure, but if this is available, we take advantage of a sophisticated automatic assignment18 algorithm. This results in a significant improvement in efficiency by using the available spectroscopic and molecular information together to get more robust NN values and labile proton peak detection for each multiplet. With the spectrum preparation complete, the compound multiplets are now provided as input to the qNMR program. The details of these processes are discussed, below, because our workflow can differ slightly from established practices to make it robust and amenable to automation. Basic Data Processing. It is well-known that the acquired data should best be processed to meet these requirements: (1) An apodization function should be chosen that does not affect the relative integrals. In practice, this means that the first point of the weighting function should not be close to zero. Exponential multiplication of ca. 0.3−0.5 Hz is therefore commonly used to modestly improve the SNR and remove possible time-domain signal truncation artifacts. The Hanning 5780

dx.doi.org/10.1021/ac400411q | Anal. Chem. 2013, 85, 5778−5786

Analytical Chemistry

Article

Figure 2. (A) 1H NMR spectrum of sucrose (600.13 MHz, 300K) in D2O after peak picking, classification, and multiplet detection. (B) qNMR analysis summary. 5781

dx.doi.org/10.1021/ac400411q | Anal. Chem. 2013, 85, 5778−5786

Analytical Chemistry

Article

may compromise, to some extent, the quantitative results of the estimated peaks parameters. In order to improve these values, a refinement stage has been added which consists in the optimization of the GSD peaks by means of a traditional line fitting procedure based on the Levenberg−Marquardt method25 which can be complemented, optionally, by a Simulated Annealing optimization algorithm to avoid potential local minima. It is worth mentioning that this refinement step is performed once the GSD peaks have been clustered into multiplets so that the peaks that need to be fit simultaneously are confined to a reasonable number. This fine-tuning procedure is practical because GSD has already identified the peaks and estimated the starting parameters which are very close to the optimal values so that only a few additional iterations will be needed. Line fitting will still provide inaccurate AIs (absolute integrals) when peak shapes are not described only by Lorenzian and/or Gaussian functions. We use the 1H NMR spectrum (500 MHz, 300K) of Ibuprofen (100.46 mM in DMSO-d6) to compare the AI data that were determined for all 8 nonlabile solute proton multiplets using each integration method. The absolute figures are normalized to the value for sum integration (AI/NN = 100.0) to afford the following results: (1) sum integration: average = 100.0, SD = 1.1; (2) GSD integrals: average = 108.2, SD = 0.8; (3) line-fitting integrals: average = 103.9, SD = 2.7. This sample is somewhat unusual in that the compound multiplets are generally well separated, the SNR is high, and sum integration is therefore set to perform well. We see that sum integration and GSD integrals yield slightly different average AI/NN values, but both have small standard deviations in the measurements across all multiplets. In this case, linefitting gives an integral value closer to that from sum integrals, indicating a better computed value. The higher standard deviation results from the inability in some cases for a wide enough peak fit region to be selected because of close peaks, and this compromises the line-fitting result. It also follows that any factor used to convert AI/NN to concentration (see below) should be separately calculated for the integration method. Choosing an integration method may require some trial and error. If the highest accuracy is required and the spectrum is of high quality and not very crowded, then conventional integration generally performs best. However, spectroscopic considerations like the closeness of lines may rule this out and make GSD the method of choice. If GSD is shown to yield acceptable accuracy and precision for samples of known concentration, then GSD in conjunction with line fitting will be unnecessary. In every case, the software will produce a result and the onus is on the researcher to assess these for a small number of typical spectra to ensure the best method is chosen. Multiplet and Nuclide Determination. With all compound peaks now reliably characterized and classified and the integral method chosen, the final preparation task is to identify and classify the multiplets according to well-known rules for first-order analysis and determine the NN for each. This process uses proprietary algorithms26 that account for the separation between lines, symmetry, and their classification. Assigning first-order multiplet structure is again a complex task that relies on consideration of peak separations and heights. If the explicit multiplicity (e.g., “triplet”) cannot be determined, the multiplet is simply classified as a “multiplet”. The exact multiplicity can be used to help automatically select the best, “clean” multiplets that are least likely to have additional signals from impurities (see below).

close to or even overlap with interfering peaks. A typical example is shown in Figure 1, where you can see that the program has correctly distinguished between compound, solvent, and impurity peaks. The user can control the number of fitting cycles the algorithm performs, providing a balance between computation time and line fitting accuracy. The process is fully automatic. Peak classification and GSD make a powerful combination because they allow peak areas from close or even overlapping multiplets to be accurately integrated and NN to be determined in a more robust way. If the residual solvent is being used as a reference integral, then the integral calibration for this can still be used when solute signals resonate very close (Figure 2). Every effort is made to identify compound labile protons. This is fundamental to qNMR: these signals are poor targets for quantitation because they often under-integrate as a consequence of partial chemical shift exchange with the water signal. If detected, they can automatically be ignored for quantitation. Integration Method. Line fitting methods assume Lorentzian or close to Lorenzian line shapes and will underperform when peaks are unsymmetrical, perhaps as a result of poor shimming. In this case, reference deconvolution24 might be effectively applied to improve the line shape. Alternately, conventional integration would be suitable, and this would be the more general solution. The choices for integration method are as follows. Conventional Integration. NMR integrals are calculated by determining the running sum of all points in the integration segment. Using standard methods for peak integration may be the best choice under the most favorable conditions: the compound is very pure, SNR is high, the baseline is stable, and there is minimal signal overlap. It has the disadvantage that relatively large spectral regions should be integrated to quantify the entire signal, and accuracy may be offset by new errors introduced by integrating noisy spectral regions possibly with imperfect baselines. This will restrict the number of peaks to be used in more difficult cases of signal overlap, if that is an issue. GSD Integration. This has the advantage of providing the area of every peak it detects and is relatively insensitive to phasing imperfections and almost completely insensitive to baseline distortions. This makes GSD integration essential when there is signal overlap, or when peaks are partially overlapping and conventional integration becomes unreliable (see above). GSD can be performed very efficiently and quickly, taking only seconds to compute even on a spectrum having hundreds of lines. Unlike traditional line fitting algorithms, GSD does not need line positions to be known or specified before it is run. Consideration must be given that GSD parameters may not have been optimized and provide a less rigorous evaluation of line shape, especially when this deviates from Lorentzian or Gaussian shapes due to, for example, improper shimming. This will affect the calculated areas and therefore the calculated concentrations. Whether or not this is a limitation can be determined by inspecting the residual between the fit and experimental lines or comparing the qNMR results for a known standard when using each integration method. Line Fitting. GSD has been designed to identify and fit all recognizable peaks even in very complex 1D spectra in a remarkably short computation time. This computational efficiency is achieved by constraining the number of fitting cycles within the range of 2 up to 10 iterations. However, this 5782

dx.doi.org/10.1021/ac400411q | Anal. Chem. 2013, 85, 5778−5786

Analytical Chemistry

Article

robustness of the procedure in general and the potential for its automation. qNMR is most often performed using one of the following approaches and the software accommodates all. These choices are described for completeness, and all can be accessed and optimized by using the software through the UI (user interface). This, however, need only be done once, and then automated operation is possible (and optimized) for a compound or spectral class. Synthetic or Endogenous Reference Peaks. The objective is to use a peak (or peaks) as a surrogate or indirect concentration reference. These are typically present in each spectrum together with signals from the analyte, but this is not an absolute requirement. The signal may be from almost any species present, with common examples being the residual solvent signal (e.g., DMSO-d6H) or the chemical shift reference signal (e.g., TMS). To implement this procedure, the AIr (absolute integral of the reference) is determined for the known region of interest (ROI) containing the reference signal(s) that is known to correspond to a given concentration. The AI/NN is computed for each solute multiplet, and the analyte concentration is directly determined using a simple ratio. The utmost care must be taken to ensure that only this reference signal(s) integral is measured. This is not a complication with synthetic signals because they can be “placed” in a spectroscopically silent region or a separate spectrum altogether. However, when the residual solvent signal is used for this purpose, it may be the case that analyte material signals coresonate or overlap, and a simple integration of this region would therefore produce an incorrect AIr and incorrect concentration values for every solute multiplet. In this case, the user should use areas determined using GSD and account also for the peak classification procedure to obtain an accurate reference signal area measurement for the residual solvent alone. Concentration Conversion Factor. This method relies on a spectrum having been acquired for a sample of known concentration, typically prepared gravimetrically. One sample may be used or several through serial dilution. An NMR spectrum is acquired for each sample using a particular instrument and under specified quantitative conditions for spectrum acquisition. Particular attention must be paid to probe tuning. Next, the AI is determined for a spectral region having a known NN. The concentration conversion factor (CCF) then is simply the number which, when multiplied by the AI/NN, affords the known concentration. While experimental conditions should not change between this calibration exercise and subsequent analyses, numerical compensation can be easily applied if the number of scans (NS) changes and, to a lesser level of success, the receiver gain (RG) or 90° pulse length (PW). The software stores the NS, PW, and RG values used for the sample to determine the reference CCF standards (NSr, PWr, and RGr). These values for the analyte are automatically checked by reading the values from the spectrometer acquisition file, and an adjustment to the AI is made automatically. This follows a straight ratio: NSr/NSa and PWr/PWa, where the “a” denotes the value used for the analyte sample. With Bruker spectrometers, RGr/RGa effects the same compensation. In practical terms, this procedure is easily applied to test samples. Simply multiplying the AI/NN for a multiplet by the CCF directly affords the concentration.

Once the peaks have been classified and the multiplet intervals determined, the number of nuclides (e.g., number of protons) corresponding to each multiplet is estimated. This is done as follows. One assumes that the total number of protons could be some value N, where N is iterated between Nmin and Nmax (typically 2 and 200, respectively). For each N, a “compatibility” score and significance are computed as follows: (1) N is first used to normalize the intensities (the total must match N). (2) For each multiplet, a check is made whether its intensity corresponds to an integer nuclei count, considering the assumed relative integration errors. For brevity, we omit the exact form of the employed probability function. (3) Multiplets which correspond to