Effective Protocol for Database Similarity ... - ACS Publications

Oct 26, 2009 - Gregory K. Pierens,*,† Mehdi Mobli,‡ and Viktor Vegh†. Centre for Magnetic Resonance and the Institute for Molecular Bioscience, ...
0 downloads 0 Views 1MB Size
Anal. Chem. 2009, 81, 9329–9335

Effective Protocol for Database Similarity Searching of Heteronuclear Single Quantum Coherence Spectra Gregory K. Pierens,*,† Mehdi Mobli,‡ and Viktor Vegh† Centre for Magnetic Resonance and the Institute for Molecular Bioscience, The University of Queensland, Brisbane, Queensland 4072, Australia The nuclear magnetic resonance (NMR) chemical shift exquisitely describes the chemical environment of the atoms in a molecule. Here we describe methods that utilize this information as an experimental probe to match 2D NMR heteronuclear single quantum coherence (HSQC) spectra of pure, unknown compounds to a database of known compounds. We implemented and compared two different approaches for similarity searching of HSQC spectra. According to our findings, our new discrete selfadaptive differential evolution method performs better than the previously published shifted grid, multiple resolution approach. The new method is provided in detail and comparisons have been performed for a set of HSQC spectra. The similarity comparison involves a peak-topeak matching of different spectra, followed by a selection criterion and ranking to establish a level of match. The methodology of many rational drug design research projects is based on the principle that structurally similar compounds are more likely to have comparable physicochemical properties.1-3 The advent of so-called molecular “fingerprints” to find similarity within databases has been rapidly adopted by the pharmaceutical industry, due to its speed in searching very large databases. These fingerprints are bit strings that encode structural features and/or calculated properties, i.e., whether a functional group or fragment is present in the molecule or the calculated property is within a certain range of values.4,5 In this type of similarity searching, molecular fingerprints are generated for the chosen reference compound and for all compounds within a database. The fingerprint of the reference compound is then compared associatively, in a bit wise manner, to each database fingerprint for overlapping features. The measure of similarity is then calculated using a figure-of-merit, which in this case is a * Corresponding author. Gregory K. Pierens, Centre for Magnetic Resonance, Level 2, Gehrmann Laboratories, Research Road, The University of Queensland, Brisbane, Queensland, Australia 4072. E-mail: [email protected]. † Centre for Magnetic Resonance. ‡ Institute for Molecular Bioscience. (1) Johnson, M. A.; Maggiora, G. M. Concepts and Applications of Molecular Similarity; John Wiley & Sons: New York, 1990. (2) Martin, Y. C.; Kofron, J. L.; Traphagen, L. M. J. Med. Chem. 2002, 45, 4350–8. (3) Patterson, D. E.; Cramer, R. D.; Ferguson, A. M.; Clark, R. D.; Weinberger, L. E. J. Med. Chem. 1996, 39, 3049–3059. (4) Eckert, H.; Bajorath, J. Drug Discov. Today 2007, 12, 225–33. (5) Willett, P. Drug Discovery Today 2006, 11, 1046–53. 10.1021/ac901616t CCC: $40.75  2009 American Chemical Society Published on Web 10/26/2009

number whose size determines the quality of fit. A number of different coefficients have been proposed in the literature, out of which the Tanimoto6 and Tversky7 coefficients are frequently used. Screening of large natural products libraries has been and still is an important and active area of research for the drug discovery industry, since they exhibit high specificity and chemical diversity. This method of drug discovery is time-consuming and frequently the isolated active compound has been reported previously or is an undesirable compound. Many dereplication protocols have been developed to identify these compounds. These usually combine the use of a separation procedure with spectroscopic methods and database searching. The most common methods used for compound identification are liquid chromatography/mass spectrometry (LC/MS)8 and liquid chromatography/nuclear magnetic resonance (LC/NMR).9 The LC/MS method has become the dereplication tool of choice, due to the small quantities of sample required and the fast turnaround of screening. The downfall of this method is that it can produce large answer sets when querying databases on the molecular weight alone. The LC/NMR methodology suffers from lack of sensitivity compared to LC/MS but provides a wealth of structural information. It can discriminate small structural features between compounds of the same molecular weight and even the same molecular formulas, e.g., structural isomers. The interpretation of the traditional 1D (1H) NMR spectrum becomes more difficult with increased molecular complexity and molecular composition of the sample (i.e., mixtures). Multidimensional NMR techniques have been developed to overcome this spectral overlap problem. These include homonuclear experiments (e.g., correlated spectroscopy (COSY) and total correlated spectroscopy (TOCSY)) and heteronuclear experiments (e.g., heteronuclear single quantum coherence (HSQC) and heteronuclear multiple bond correlation (HMBC)).10 The heteronuclear 13C HSQC experiment is extensively used in structure elucidation or confirmation of molecular structure for isolated or synthesized compounds. The result of perform(6) Willett, P.; Barnard, J. M.; Downs, G. M. J. Chem. Inf. Model. 1998, 38, 983–996. (7) Tversky, A. Psychol. Rev. 1977, 84, 327–352. (8) Shigematsu, N. J. Mass Spectrom. Soc. Jpn. 1997, 45, 295–300. (9) Bobzin, S. C.; Yang, S.; Kasten, T. P. J. Ind. Microbiol. Biotech. 2000, 25, 342–345. (10) Claridge, T. D. W. High-Resolution NMR Techniques in Organic Chemistry; Elsevier Sciences: Amsterdam, The Netherlands, 1999; Vol. 19.

Analytical Chemistry, Vol. 81, No. 22, November 15, 2009

9329

ing such an experiment is a 2D plot with 1H and 13C chemical shifts defining the axes. The high-intensity features of this plot, referred to as “peaks”, allow identification of directly bonded hydrogen and carbon atoms within a compound. The peak pattern or distribution of peaks can be thought of as being very characteristic of a particular substance. The 2D coordinates of the peaks are generally reported without any reference to intensity or peak width. The coordinates themselves are referred to as the proton and carbon chemical shifts. The chemical shifts give an indication of the chemical environment of the atoms, and hence, information about the chemical structure of the compound. However, in most cases further structure elucidation experiments must be employed to obtain the molecular composition of an unknown compound. As already mentioned, molecular fingerprints have been used extensively to compare compounds, particularly in the drug discovery and design industries, here we wish to apply this approach to experimental 2D NMR data. The aim is to measure similarities between a reference compound and a database of compounds, allowing possible matches to be identified. However, methods capable of achieving this are scarce. Recently, work was published that details similarity searching using 2D NMR spectra11 to identify duplicate pure compounds based on HSQC NMR data. This published grid-based HSQC fingerprints method will be compared to a new approach of HSQC similarity searching, which is based on an optimization technique termed discrete self-adaptive differential evolution.12 A method of selecting similar HSQC spectra and rankings is also outlined. In this work, the proposed method is discussed for pure compound HSQC spectra. A brief demonstration of this method to more complicated spectra, such as noisy spectra and spectra of mixtures of compounds, is documented in the Supporting Information. In the following section, the grid-based method is briefly outlined along with a detailed description of the proposed approach. Afterward, comparative results based on a set of 51 HSQC spectra are provided to establish the ability of the two methods to perform database matching of HSQC spectra. METHODS Grid-Based Methods of Generating HSQC Fingerprints. The method outlined by Hinneburg11 was implemented in the MATLAB computing environment.13 Three methods for generating a fingerprint of a HSQC spectrum were proposed: (1) shifted grids, (2) different resolution grids, and (3) combination of the shifted and different resolutions. Only the combination of shifted grids with different resolutions will be presented, since this method was found to be superior over the other two implementations according to Hinneburg et al. All three implementations were proposed to overcome the shortfall of a simple rectangular fixed grid. Since the fingerprints are represented as binary strings and are compared in a bit wise manner, adjacent peaks falling into different cells of the binary string that are deemed to be visually similar will never be directly compared. As a consequence, accurate matching of HSQC spectra requires that the spectra are (11) Hinneburg, A.; Egert, B.; Porzel, A. J. Integr. Bioinf. 2007, 4, 53. (12) Vegh, V.; Pierens, G. K.; Tieng, Q. M. Int. J. Inn. Comp., Inf. and Control, Submitted for publication. (13) The MathWorks. http://www.mathworks.com.

9330

Analytical Chemistry, Vol. 81, No. 22, November 15, 2009

all correctly referenced.14 Therefore, in our implementation of the grid-based method the fingerprint was generated with four different grid resolutions and five different grid shifts, which were in accordance with good practice as reported in Hinneburg et al. The four different resolutions were chosen to minimize the overlap of the cell borders. The grid dimensions that were used for carbon were 3, 4, 7, and 10 ppm and for the proton 0.3, 0.4, 0.7, and 1 ppm, respectively. The advantage of the grid-based method is the fast generation of fingerprints and direct calculation of the Tanimoto coefficient. The grid-based method nonetheless suffers from specific major disadvantages: (1) The grid-based cells have boundaries, and peaks with similar chemical shifts located in different cells will not be recognized as being similar. (2) Multiple peaks in close proximity may be in the same grid cell and therefore only contribute one bit to the fingerprint, as opposed to the number of peaks. (3) The Tanimoto coefficient does not consider the bit density of the fingerprint and has some bias toward fingerprints with similar bit densities. Peak Matching Method. Given the limitations present in the grid-based method, we propose a better approach that involves performing a one-to-one match of peaks between two different HSQC spectra. The question of matching different number of HSQC peaks first arises. Since our goal was to perform a one-toone match of the peaks between different spectra, the number of matched peaks must be determined from the HSQC spectrum with the least number of peaks, and the leftover peaks from the larger spectrum will be ignored in the preliminary comparison. For example, this means that if a HSQC spectrum containing six peaks is compared to another spectrum containing eight peaks, then the six peaks of the first spectrum are mapped or matched to the “best matching” six peaks in the second spectrum, such that the overall sum of peak-to-peak distances is minimized. Two different spectra (denoted as sets of complex points {Pi: i ) 1, . . . ,NP} and {Qj: j ) 1, . . . ,NQ} arranged such that NP e NQ) are matched using a mutually distinct permutation d ∈ ∼ NNP e NQ through an error function appropriately measuring the sum of distances between the matched peaks: NP

error(d ˜) ) log(1 +

∑ |P

i

- Q(di)|),

i)1

error ) 0,

iff P ≡ Q

To find d given the error function above, we propose the use of ∼ the discrete self-adaptive differential evolution (dSADE) algorithm (see algorithm 4 in Vegh et al.12). The developed discrete optimization technique is general in its nature, and it was applied to this particular problem of matching HSQC spectra. To find d, the algorithm starts off with a ran∼ domly distributed search space, which is then refined through successive generations of the parameters. The convergence to the required solution was achieved by minimizing the error function. For completeness, pseudo code of the algorithm is presented below. In the algorithm, p is the initial search space, rand is a function generating uniformly distributed random numbers in (0,1), Gmax (14) Zheng, M.; Lu, P.; Liu, Y.; Pease, J.; Usuka, J.; Liao, G.; Peltz, G. Bioinformatics 2007, 23, 2926–2933.

is the maximum number of generations, and M is the size of the search space (i.e., population size). All results provided were obtained using Gmax ) 50 000, M ) 2NQ, and f was the error function from above. Function mod1 provides modulo 1 arithmetic over the parameter space and ensures that the parameter control variable (i.e., p) is bounded. The order function returns indices of the entries corresponding to values in any vector from smallest to largest. Moreover, Fj and CRj are adaptive parameters of the general differential evolution algorithm,15 cj is a vector of integers in [1,NP] to be augmented through the iterations, and f is the error function to be minimized. Mutually distinct integers r1, r2, and r3 in [1,M] are used for parameter mixing. RESULTS AND DISCUSSION A small database of 51 spectra was used for this study (see Figure S-4 in the Supporting Information for compound structures). The spectra of the compounds were assembled according to their particular properties: (i) a small combinatorial library16,17 (13 compounds, 1-13) was based on the 3-chloro-4-hydroxyphenylacetamide fragment, (ii) a group of natural products (10 compounds; actinophyllic acid (14),18 endriandrin A (15),19 latifolian A and B (16, 17),20 mearsamine (18),21 perspicamide A and B (19, 20),22 petrosamine B (21),23 spermatinamine (22),24 and a derivative of vanillic acid (23)25), which included (15) Brest, J.; Greiner, S.; Boskovic, B.; Mernik, M.; Zˇumer, V. IEEE Trans. Evol. Comput. 2006, 10, 646–657. (16) Davis, R. A.; Watters, D.; Healy, P. C. Tetrahedron Lett. 2005, 46, 919– 921. (17) Davis, R. A.; Pierens, G. K.; Parsons, P. G. Magn. Reson. Chem. 2007, 45, 442–445. (18) Carroll, A. R.; Hyde, E.; Smith, J.; Quinn, R. J.; Guymer, G.; Forster, P. I. J. Org. Chem. 2005, 70, 1096–1099. (19) Davis, R. A.; Carroll, A. R.; Duffy, S.; Avery, V. M.; Guymer, G. P.; Forster, P. I.; Quinn, R. J. J. Nat. Prod. 2007, 70, 1118–1121. (20) Rochfort, S. J.; Towerzey, L.; Carroll, A.; King, G.; Michael, A.; Pierens, G.; Rali, T.; Redburn, J.; Whitmore, J.; Quinn, R. J. J. Nat. Prod. 2005, 68, 1080–1082. (21) Katavic, P. L.; Venables, D. A.; Guymer, G. P.; Forster, P. I.; Carroll, A. R. J. Nat. Prod. 2007, 70, 1946–1950. (22) McKay, M. J.; Carroll, A. R.; Quinn, R. J. J. Nat. Prod. 2005, 68, 1776– 1778. (23) Carroll, A. R.; Ngo, A.; Quinn, R. J.; Redburn, J.; Hooper, J. N. A. J. Nat. Prod. 2005, 68, 804–806. (24) Buchanan, M. S.; Carroll, A. R.; Fechner, G. A.; Boyle, A.; Simpson, M. M.; Addepalli, R.; Avery, V. M.; Hooper, J. N. A.; Su, N.; Chen, H.; Quinn, R. Bioorg. Med. Chem. Lett. 2007, 17, 6860–6863. (25) Feng, Y.; Carroll, A. R.; Addepalli, R.; Fechner, G. A.; Avery, V. M.; Quinn, R. J. J. Nat. Prod. 2007, 70, 1790–1792.

more complex compounds that should not be similar to others, and (iii) a set of 28 randomly selected compounds (24-51).26 All experimental NMR data were taken directly from the literature in which all spectra were acquired in DMSO-d6. The input data was a list of Cartesian points in which the 13C chemical shift is represented as the x-coordinate and 1H as the y-coordinate. Any duplicate (x,y) pairs of points were removed, so that the input data resembled the data directly obtained from a processed 2D NMR spectrum. Information about whether the peaks originated from methyne, methylene, or methyl carbons was not used in the analysis. Grid-Based Methods of Generating HSQC Fingerprints. The fingerprints produced using the combined shifted grids with different resolutions were very large, consisting of 61 152 bits. The bits were set to 1 if a peak was found in a corresponding grid cell, otherwise, the bits were set to 0. The fingerprints were generated quickly, since the bit positions of the fingerprint were populated directly from the chemical shifts through an appropriate mapping. The quality of the comparison of the generated fingerprints was calculated using the Tanimoto coefficient,6 which results in a number between 1 (i.e., the same, bit-strings being identical) and 0 (i.e., dissimilar, no overlap in bit-string). The HSQC spectra from the database was compared to itself and as a result each compound must have one perfect match (Tanimoto coefficient ) 1). This validation was used as a control throughout this study. A major consideration in any similarity comparison method is the definition of “similarity”. A cutoff of 0.7 is generally accepted as a standard when the Tanimoto coefficient is used for molecular fingerprinting. If a Tanimoto coefficient of 0.7 was used as the cutoff value, this would result in the selection of compounds having a minimum of 70% of the bits set to “1” in the same position within the bit-string. This level of cutoff cannot be used with similarity searching applied to HSQC spectra using the above fingerprint method. This is due to the fact that the bit density of our fingerprint is very low, resulting in a much smaller Tanimoto cutoff than for molecular fingerprints. The grid-based similarity was performed on the database and the number of predicted similar compounds using the varying Tanimoto coefficient cutoff (0.35, 0.5 and 0.7) (see Supporting Information). There appears to be very little variation in the number of predicted compounds when the Tanimoto cutoff is changed between 0.35 and 0.70. The only exceptions are between compounds from the combinatorial library (compounds 1-13). As the cutoff is restricted to select more similar HSQC spectra, the amount of predicted similarity for compounds 1-13 is dramatically reduced. Peak Matching Method. The dSADE protocol was implemented in the MATLAB computing environment,13 and an example of the resulting peak-to-peak matching between two spectra is illustrated in Figure 1. The matched peaks are shown by linked solid lines. In this example, the first HSQC spectrum contains 11 peaks and the second has 5 peaks. This results in the match of the 5 peaks from the first spectrum to the “best” 5 peaks in the second spectrum. The remaining 6 unmatched peaks in the second spectrum will not contribute to the initial selection (26) Griffiths, L.; Beeley, H. H.; Horton, R. Magn. Reson. Chem. 2008, 46, 818– 827.

Analytical Chemistry, Vol. 81, No. 22, November 15, 2009

9331

Figure 1. An example of using dSADE matching on two HSQC spectra: (a) HSQC spectra (A, O, and B, b) and (b) HSQC matching showing the matched (O and b joined by a solid line) and unmatched peaks (O).

Figure 2. The result of matching of two HSQC spectra of the same compound in which one spectrum was incorrectly referenced. (Note: peaks are not necessarily matched by closeness but the overall peakto-peak distance.)

of similarity. A major advantage of using our novel peak-to-peak matching is that it is possible to detect spectra which are incorrectly referenced. In such a case, a match in which all peaks have the same distance offset and direction is obtained. The incorrectly referenced matches are included in the resulting list of spectra. An example of incorrectly referenced but correctly matched spectra is shown in Figure 2. A disadvantage of using the discrete self-adaptive differential evolution method is the amount of time required to match spectra of larger size. As the number of peaks in the HSQC spectra increases, the time taken to perform a match also increases, due to the factorial relationship between the spectra length and the possible solutions. Additional information about the computational time with respect to comparison size is presented in the Supporting Information. In summary, when the size comparison doubles, i.e., from a “n” peak to “n” peak comparison to a “2n” peak to “2n” peak comparison, the time increases by approximately a factor of 10. This increase in time could limit the general applicability of our technique to large molecules or mixtures of compounds. Further work is required to fully investigate the limits of this technique. Nevertheless, obtaining the correct peak-to-peak match is the primary objective of our research. The chemical shifts were normalized by dividing the carbon chemical shift by 160 and the proton chemical shift by 12, before calculating the matched peak-to-peak distances. This was necessary, due to the large difference in sweep width measured in parts per million of the carbon and proton axes. The optimal peak-topeak match returns the sum of distances between the corresponding matched peaks. The sum of all peak-to-peak distances results in a total measure for the HSQC comparison. This total distance could be used to compare the spectra, but it was found that individual large peak-to-peak distances dominate the measure and closely located peaks are obscured. A better metric was found to be the average matched distance per peak (AMD). This was 9332

Analytical Chemistry, Vol. 81, No. 22, November 15, 2009

calculated by dividing the total distance by the number of peaks that contribute to the total distance. A threshold value was determined empirically for the matched HSQC spectra. It was found that an AMD of less than or equal to 0.05 normalized distance units provided an appropriate group of “similar” HSQC spectra. This value corresponds to an 8 ppm distance error, if the error was only associated with carbon chemical shifts. Similarly, a 0.6 ppm distance error was implied, if the error was only associated with proton chemical shifts. These are conservative bounds which can be changed by varying the user defined selection criteria. The selected set of “similar” HSQC spectra can be used at this stage to search for spectra that have common substructures or to find duplicate spectra. To this point, the data have been selected on a unique peak-to-peak matching basis and the unmatched peaks have not been considered in the analysis. Several protocols were considered to include the unmatched peaks in the classification and as part of the similarity figure-of-merits. A correlation coefficient (Pearson product moment correlation coefficient27) was calculated between the two spectra, which is only possible when the spectra have the same number of peaks. Therefore, the spectrum with the smaller number of peaks was padded to become the same length as the larger spectrum and two different strategies were examined. The first was to match the unmatched peaks to an average chemical shift of the other HSQC spectrum. The second method was to match the unmatched peaks in the larger spectrum to the closest peak in the other spectrum. Both of these methods biased the match and were not considered further in our investigations. It should be noted that calculating the correlation coefficient between two differently referenced spectra of the same compound was a quick method of detecting incorrectly referenced duplicate spectra. No unbiased method for the inclusion of the unmatched peaks involving the correlation coefficient has been found and remains under investigation. The protocol decided upon was to scale the AMD by taking into consideration the number of unmatched peaks. This was achieved by using an exponentially weighted correction factor with the attenuation parameter set to 4. A scaled average matched distance (sAMD) was calculated by dividing the already calculated AMD by the scaling factor (see equation). If the HSQC spectra have the same number of peaks, i.e., all peaks are used in the match, then the scaling factor was 1 and the values for AMD and sAMD would be the same. If the peak difference between the two HSQC spectra were 1, 2, 3, or 4, then AMD values will increase by a factor of 1.06, 1.25, 1.56, and 2.0, respectively, resulting in the sAMD (note (27) Stigler, S. M. Stat. Sci. 1989, 4, 73–79.

Figure 3. Similarity matrix for (a) the peak matching method with AMD e 0.05 (bottom) and further refined by AMD e 0.05 and sAMD e 0.075 (top). (b) The comparison of the grid-based method (bottom, most similar 106 compounds) and the peak matching method e0.05 and sAMD e 0.075 (top, 106 compounds). The gray squares are HSQC spectra found to be similar by both methods.

that the attenuation factor effectively describes the point at which the probability is reduced by half). The sAMD was used to rank the level of similarity of already selected HSQC spectra. A cutoff value for the sAMD of less than or equal to 0.075 was used in the analysis, but this cutoff value can be optimized for the specific research question (the value is in practice arbitrary as a ranked list of hits may be produced, displaying for example the top 20 hits regardless of cutoff value used for sAMD) scaling factor )

1 (1 + (∆/af)2)

where ∆ was the difference in the number of peaks between the two HSQC spectra and af is the attenuation factor; in our case, we set it to 4. The AMD/sAMD similarity matrix is shown in Figure 3a, where a black square is an indication of similarity and the values on the axis represent distinct spectra. The left-hand side lower triangle of the matrix represents similar HSQC spectra using AMD e 0.05, resulting in 184 matches. The right-hand side, upper triangle of the matrix represents similar HSQC spectra restricted by AMD e 0.05 and sAMD e 0.075. The use of the sAMD coefficient is primarily to correct for differences in the number of peaks, but if the initial match is very good, spectra with large differences in the number of peaks can still satisfy this second selection criterion. To illustrate this point, compound 1 has only 4 peaks, but it was still found to be similar to all compounds within the combinatorial library set (1-13). Compound 6 has the largest number of peaks in this combinatorial set (15 peaks), and even when the sAMD coefficient is increased by 700% compared to the AMD value, it still satisfies the sAMD cutoff. With the use of the sAMD cutoff on the preliminary selected compounds, approximately two-thirds of the initial set of HSQC spectra survive the sAMD refinement. In summary, the protocol used in the similarity searching of a database of HSQC spectra of pure compounds is to initially match the peaks from two HSQC spectra using the dSADE algorithm. Second, given an established set of matched peaks, an AMD was

calculated. The initial set of similar HSQC spectra had an AMD of e0.05 normalized distance units. This set contains spectra that have all peaks closely matched but with no consideration toward the unmatched peaks in the larger spectrum. The third stage of the selection process is to apply a correction to the AMD to compensate for the difference in the number of peaks between the two spectra. This is achieved by dividing the AMD by the scaling factor with an attenuation parameter set to 4 as was used in our protocol. The resulting sAMD coefficient cutoff was set to a value of e0.075 which reduced the set of AMD matched spectra to obtain a final set, which is ordered by increasing sAMD size with the smallest value being classed as the most similar spectrum. Comparison of the Two Methods. A direct comparison between the two methods was deemed necessary to establish the appropriateness of our proposed method. It was decided to compare the peak matching method to the grid-based method using the same number of predicted similar spectra. Our proposed protocol predicted 106 similar spectra, which required the lowering of the Tanimoto threshold used in the grid-based method to 0.23. This resulted in the prediction of the same number of similar spectra. The two methods gave similar results, with 67/106 of the predicted spectra being predicted by both methods (Figure 3b). The major difference was that the grid-based method predicts more spectra within the combinatorial library spectra (1-13), compared to our peak matching protocol. This again highlights a major problem with the grid-based method, since it clusters peaks in the aromatic region of compounds 1-13. This results in a larger chance of predicting similar spectra, compared to the proposed method, which uses a one-to-one peak comparison. Only one set of HSQC spectra predicted to be similar will be examined in detail in this article for the sake of brevity. In this case, the reference spectrum was chosen to be spectrum 40. Both methods were used to predict which spectra were similar to our reference spectrum from our small HSQC database. The grid based method found spectrum 40 was similar to three other spectra 40, 37, and 25 at a Tanimoto threshold of 0.23. The peakto-peak matching followed by the AMD selection predicted seven spectra to be similar. These were spectra 40, matched to 37, 30, Analytical Chemistry, Vol. 81, No. 22, November 15, 2009

9333

Figure 4. Similar peak-to-peak matching of spectrum 40 to (a) spectrum 37, (b) spectrum 30, (c) spectrum 25, (d) spectrum 33, (e) spectrum 31, and (f) spectrum 14. HSQC spectrum with a larger numbers of peaks use the O symbol, and the HSQC spectrum with a smaller number of peaks use the * symbol.

compounds are shown in Figure 5. From the results, the similarity of spectrum 40 to spectra 30, 33, and 31 can be justified also from their structures. However, the same cannot be said for spectrum 14 as both the spectrum and the structure of the compound is in poor agreement to the reference compound. It is thus encouraging that this is removed from the final set of HSQC spectra sAMD criteria. CONCLUSIONS

Figure 5. Structures for the compounds found by similarity searching of spectrum 40 with the database of HSQC spectra.

25, 33, 31, and 14, which are shown in Figure 4. Spectrum 14 was removed from this set after applying the scaling factor to correct for size differences. Both methods found the exact match to itself and found spectra 37 and 25. The structures of these particular 9334

Analytical Chemistry, Vol. 81, No. 22, November 15, 2009

We have developed a method of similarity matching of HSQC spectra of pure compounds, which can be applied to database searching. The protocol has several steps. At the center of this process is first use the discrete self-adaptive differential evolution algorithm to uniquely match the peaks from one HSQC spectrum to another. The initial set of spectra are obtained by selection of spectra that have an average matched distance per peak less than a set user defined value, in our case e0.05. The selected set of spectra at this stage can be used to detect common subsets or substructures with the spectra or duplicate spectra. The similarity criteria can be further refined by scaling the average matched

distance per peak by an exponential correction, which is determined by the difference in the number of peaks between the two spectra. The protocol can also detect spectra that are identical but have been referenced incorrectly. The peak to peak matching protocol described is a promising method to find similar spectra in a database given a reference spectrum. We can envisage this type of similarity protocol to be primarily used by a natural product chemist to perform dereplication to find already identified compounds as well as potentially identifying related compounds (where simple modification such as alkylation, etc. naturally occur). In addition, this approach can be applied to aid in structure elucidation and to speed up the identification of unknown samples, be they products of synthetic chemistry or byproducts thereof.

ACKNOWLEDGMENT The research was supported with assistance from the State Government of Queensland through the Queensland NMR Network. Dr. Mobli acknowledges the ARC for funding support (Grant Code: DP0774245). SUPPORTING INFORMATION AVAILABLE Additional information as noted in text. This material is available free of charge via the Internet at http://pubs.acs.org. Received for review April 29, 2009. Accepted September 22, 2009. AC901616T

Analytical Chemistry, Vol. 81, No. 22, November 15, 2009

9335