Application of visual spectral matching techniques to automated

Application of visual spectral matching techniques to automated carbon-13 nuclear magnetic resonance library searching. Scott E. Carpenter, and Gary W...
0 downloads 0 Views 1MB Size
Anal. Chem. 1988, 60, 1886-1895

1086

phase-nulled experiment decreased markedly in the absence of rhodamine 6G. The noise in the phase-nulled Raman spectrum of Figure 2 was therefore due to noise on the fluorescence. Possible sources of noise on the fluorescence are laser intensity fluctuations, shot noise on the fluorescence, and phase jitter. The 15-MHz noise on the fluorescence would have to be on the order of 0.2% at 550 nm to account for the noise in the Raman spectrum. High-frequency laser fluctuations were reduced well below 0.2% in this experiment by careful mode locking. The shot noise on the fluorescence at 550 nm was an order of magnitude less than 0.2%, as calculated from the 4-PA average current and lo6 gain. Phase jitter, which arises from asynchronous shifts between the photomultiplier response and the reference photodiode response, appears to be the origin of the noise in the phase-nulled Raman spectrum. Factors such as drift in the high-voltage supply of the photomultiplier or spatial instabilities of the reference beam came phase changes. A phase jitter of 0.2’ (2a) for a 1-s RC time constant would account for the noise in the data. This level of phase jitter is modest for a 329-MHz experiment, being only a factor of 2 larger than the jitter we observe between the outputs of a pair of photodiodes.

CONCLUSIONS Very high enhancements of Raman signals over fluorescence signals are predicted and achieved by phase nulling. By use of a conventional photomultiplier having a 1-ns risetime, the enhancement is higher than that reported for a time-resolved experiment using a microchannel-plate photomultiplier gated with a 30-ps aperture time. Operation at higher frequencies using a microchannel-plate photomultiplier would improve the performance of the phase-nulling experiment by strongly demodulating the fluorescence to reduce the effect of phase jitter. By operation a t 3 GHz rather than 300 MHz, for ex-

ample, the ac component of the fluorescence would be reduced by an order of magnitude. High-frequency operation would also relax the need for single-component fluorescence decays, as all fluorescence decays slower than 3 ns are phase shifted to 89O or more at a modulation frequency of 3 GHz. To its disadvantage, phase nulling requires high Raman levels to overcome the fluorescence noise. Time-resolved spectroscopy, by contrast, has lower fluorescence rejection, but rejects the fluorescence background in proportion to the fluorescence itself. Phase nulling may find unique suitability in the area of resonance Raman detection of weak fluorophors, where high Raman signals and large fluorescence backgrounds are inherent.

LITERATURE CITED Bowman, W. D.; Spiro, T. G. J. Raman Spectrosc. 1980, 9,369. Asher, S. A.; Johnson, C. R.; Murtaugh, J. Rev. Sci. I n s t ” . 1983, 5 4 , 1657. Williams, K. P. J.; Gerrard, D. L. O p t . Laser Technol. 1985,245. Chase, 6.Anal. Chem. 1987,59, 881A. Yaney, P. P., J. Opt. SOC.Am. 1972, 62, 1297. Harris, J. M.; Chrisman, R. W.; Lytle, F. E.; Tobias, R. S. Anal. Chem. 1976,48, 1937. Watanabe, J.; Kinoshita, S.; Kushida, T. Rev. Sci. Instrum. 1985,56, 1195.

Bright, F. V.; Hieftje, G. M. Appl. Spectrosc. 1986. 4 0 , 583. Veselova, T. V.; Cherkasov, A. S.; Shirokov, V. I. Opt. Spectrosc. 1970, 29, 617. Lakowicz, J.; Cherek, H. J. Biol. Chem. 1981,256, 6348. McGown, L. 6.;Bright, F. V. Anal. Chlm. Acta 1985, 169, 117. Granon, E.; Jameson, D. M. Anal. Chem. 1985,5 7 , 1694. Keating-Nakamoto. S. M.; Cherek, H.; Lakowicz, J. R. Anal. Chem.

l987,b,271. Demas, J. N.. and Keller, R. A., Anal. Chem. 1985,5 7 , 538. Marcuse, D. Principles of Quantum Electronlcs; Academic: New York, 1980. Wirth, M. J.; Chou, S.-H. Appl. Spectrosc. 1988,42, 483.

RECEIVED for review December 28, 1987. Accepted May 6, 1988. This work was supported by the National Science Foundation under Grant CHE-8713911.

Application of Visual Spectral Matching Techniques to Automated Carbon4 3 Nuclear Magnetic Resonance Library Searching Scott E. Carpenter and Gary W. Small* Department of Chemistry, T h e University of Iowa, Iowa City, Iowa 52242

Llbrary searchlng procedures for ‘‘C nuclear magnetic resonance spectra are descrlbed based on the Implementation of vlsual spectral matchlng technlques. A mapping algorithm Is used to match spectral lines desplte dlfferences In chemical shifts due to solvent effects and experimental artifacts. I n addltlon, reverse search capabilltles are developed to aM the investigator in analytlng mixture spectra, as well as complex spectra not contained In the library. A dlscumlon of the unlque, Interpretive features of the search is provlded, and results are presented to Illustrate the performance of the search.

Carbon-13 nuclear magnetic resonance spectroscopy (I3C NMR) is an important tool for solving structure-elucidation 0003-2700/88/0360-1888801 SO10

problems. The interpretation of I3C NMR spectra can often be very time-consuming, however. In many instances, the chemist must resort to manual comparisons of the spectrum of an unknown to spectra of known compounds he or she feels may be similar. Computer-aided library search procedures represent one way of automating this spectral comparison process. In a library search, numerical procedures are used to compare an unknown spectrum to each member of a spectral database. Those spectra judged most similar to the unknown are reported as candidate structures. Several library searches have been described for 13C NMR data. Binary (peak/no peak) searches have been used by Zupan et al. (I) and Uthman et al. (2). Clerc and co-workers have made use of a binary representation or “signature” in an attempt to represent underlying chemical structure in 0 1988 American Chemical Society

ANALYTICAL CHEMISTRY, VOL. 60, NO. 18, SEPTEMBER 15, 1988

addition to shift information (3). Mlynarik et al. introduced a search that appears to use full spectral data (4). The search algorithm incorporated a tolerance value that could be manually adjusted, thereby improving direct shift matching by accounting for movement of shifts due to structural, experimental, or solvent effects. As part of a combined search system, Zippel et al. (5) have used full spectral data and an algorithm that includes an adjustable shift tolerance, tolerance “windows” for the spectral matching procedure, and calculation of a matching factor that considers the relationship between the unknown spectrum and a library spectrum based on the number of matching signals. This paper presents the development of a new library search for 13C NMR spectra that simulates the visual mapping procedure a chemist uses when comparing two spectra. An “interpretive” algorithm is implemented that can recognize special-case matching situations and vary its shift-matching procedure accordingly. When matching between the unknown and a library spectrum is completed, a scoring procedure is used to determine the degree to which the overall pattern integrity of the unknown has been preserved. This work also features reverse search capabilities to aid in identification of mixture components and sample impurities. The developed search procedure is compatible with a typical laboratory microcomputer.

EXPERIMENTAL SECTION The computer software for this research was written in FORTRAN 77 and implemented initially on a PRIME 9955 interactive computer system operating in the Gerard P. Weeg Computing Center at the University of Iowa. By use of the Microsoft FORTRAN Optimizing Compiler, Version 4.01 (Microsoft Corp., Redmond, WA), a version of the library search has also been implemented on a 6-MHz IBM PC-AT with a 20-Mbyte hard disk drive and 80287 math coprocessor. The microcomputer version of the program requires 212-kbytes of user memory. Several utility subroutines from the ADAPT software package (6) were used. Modified versions of subroutines described by Hartigan (7) were used to implement the Fisher clustering algorithm. Spectral plots were generated with original software. A Hewlett-Packard 7475A digital plotter was used as the output device. The 13CNMR spectral library used for this research consisted of 7198 spectra from the NIH-EPA 13C NMR database (FeinMarquart Associates, Baltimore, MD). Storage requirements for the library as implemented by us on the IBM PC-AT are approximately 385 kbytes of disk storage and 120 kbytes of additional memory used as a RAM disk during the search. An additional subset of 67 spectra of various monosaccharides was added to the library to evaluate the reverse search of an anomeric mixture spectrum of D-altrose. Similarly,a set of 46 disaccharide spectra was added to the library to evaluate the reverse search of an amygdalin spectrum. Spectra in these two library subsets were taken from the literature, as well as from data collected locally. The three target spectra used in the reverse search examples were also collected locally. All spectra used in the search were referenced to either sodium 2,2-dimethyl-2-silapentane-5-sulfonate (DSS) or (Me),Si. The DSS and (Me),Si reference lines are coincident. For the local data collection, compounds were dissolved in D20 (approximately 1.0 M) except for the aminobenzoic acid mixtures, which were dissolved in CDC13 Broad-band decoupled spectra were recorded at 25 “C and 90.56 MHz on a Bruker WM-360 superconducting magnet NMR spectrometer operating in the University of Iowa High-Field NMR Facility. A 5-mm C/H probe was used, and the free induction decay size was 32K. Chemical shifts were measured relative to internal DSS or to (Me),Si. RESULTS AND DISCUSSION Overview of Search Design and Methodology. The spectral matching process used in library searching can be viewed as a problem in pattern recognition (8). The 13C NMR spectrum of an unknown is a set of lines defining a distinct visual pattern that can be used to identify and/or classify the

1887

I I n p u t Unknown S p e c t r u m

I Clustering/Partitioning

lN

Input a Library Spectrum] f o r Comparison I

i /Mapping A l g o r i t h m /

I

1 S e l e c t B e s t Matches BasEd on M a t c h Q u a l i t y

I

iScoring

4

A l l P a r t i t i o n s Compared?

I

N

Figure 1. Flow chart illustrating search design.

unknown compound. In library searching, the identification process involves comparison of an unknown pattern to a set of known patterns. Arguably, the best pattern recognition algorithm for performing such comparisons is that implemented by the eyes and thought processes of a chemist. The focus of this research was to develop an automated library searching algorithm that encodes the decision-making processes involved in visual 13C NMR spectral matching. A flow chart illustrating the overall search design is given in Figure I. A clustering algorithm is employed to separate the spectral lines of an unknown spectrum into subsets or clusters based on the number of lines and their relative proximity to one another. The clustering results are used to divide the unknown spectrum into partitions. If a library spectrum is indeed similar or equivalent to the unknown spectrum, it should have the same number of clusters with similar or identical partition boundaries. Therefore, before each comparison, each library spectrum is divided into the same number of partitions with exactly the same partition boundaries as the unknown spectrum. The next step of the search is to compare the lines in each partition of the unknown spectrum to the lines in the corresponding partition of each library spectrum. Within each partition, a match tolerance, or slide factor, is computed. The slide factor is the maximum distance that two lines can be separated in order to be still classified as matching. The slide factor enables each match to be assigned a quality index indicating an exact match, a “good” match, or a mismatch. The line mapping section of the search algorithm provides information about all possible line matches within the corresponding unknown and library partitions. Every line of one is compared to every line contained in the other and is stored in a matrix format representing all possible match combinations. The last phase of the search algorithm involves scoring each of the selected best matches. Match penalties are assigned to “good” matches and mismatches, in contrast to exact matches, which are assigned a penalty of zero. Missing and extraneous lines also acquire a penalty. Each comparison between corresponding partitions is scored separately. The

1888

ANALYTICAL CHEMISTRY, VOL. 60, NO. 18, SEPTEMBER 15, 1988

Table 1. Analysis of

NMR Spectrum of 1-Hexyne

Best Clustering Arrangements Given by Fisher Algorithm no. of clusters

sum of squares

% of total

1 2 3 4

4364.4921 294.0452 159.3987 35.3000 7.2200 0.0000

0.00 93.26 96.35 99.19 99.83 100.00

5 6

Optimum Clustering Using the 85% Selection Criterion Two Clusters with Sum of Squares = 294.0452 cluster

no. of obs

mean

std dev

1 2

4 2

21.0525 76.3050

8.3127 8.2050

Partition Boundary Values partition

lower boundary

upper boundary

1

-100.00 47.25

47.24 250.00

2

scores for each partition comparison are summed, resulting in a total score for the overall spectrum. The degree of similarity is indicated by the magnitude of the score value. A score of zero indicates an exact spectral match. Increasing magnitude in the score value indicates increasing spectral dissimilarity. Upon completion of a search, the investigator is provided with a list of best matches. Visual spectral comparisons can then be performed to evaluate the search results. Clustering. Clustering methods are numerical procedures for grouping similar objects (7-9). The selection of an appropriate clustering algorithm depends on the type of data being analyzed. For our purposes, the algorithm developed by W. D. Fisher (7) was found to be particularly well-suited to ordered one-dimensional data. A set of chemical shifts comprising a spectrum can be regarded as a set of time-ordered data points. The clusters of such data are constrained to be consecutive and nonoverlapping, which provides additional simplicity to the algorithm. NMR spectrum of 1-hexyne contains six chemical The shifts (84.5, 68.1, 30.7, 21.9, 18.1, 13.5 ppm) and serves as a useful example in explaining the principles of the Fisher algorithm. For this spectrum, the maximum number of possible clusters is equal to six, where each cluster contains one spectral line. However, this arrangement offers no information concerning the degree of similarity between neighboring lines. The lines may also be grouped into five clusters. There are five possible ways to group six objects into five clusters. Each possible five-cluster arrangement is evaluated by use of an error function. That arrangement producing a minimum value of the error function is taken as the best. The error function employed is the summation of the sums of squares calculated for each cluster. The sum of squares for a given cluster is simply the sum of the squares of the deviations of the lines within the cluster from the cluster mean. Analogously, all groupings of the six lines into four clusters, three clusters, and so on must be investigated. Upon completion, the Fisher algorithm provides a list of the best arrangements of the six spectral lines into one through six clusters. T o compare each clustering arrangement, a value representing the percent reduction of the total sum of squares can be computed by using the standard percent-of-total calculation and the sum of squares of the one-cluster arrangement as the total sum of squares value. A list of results for the example of 1-hexyne is presented in the first section of Table I. For many example spectra, it was experimentally found that the

first encountered clustering arrangement representing a reduction in the sum of squares of approximately 85% or greater was the one that provided a result similar to that obtained by visual grouping of the spectral lines. Inspection of the 1-hexyne chemical shifts reveals there are two clusters of spectral lines. The upfield cluster contains four lines corresponding to four sp3-hybrid 13C atoms, and the downfield cluster contains two lines corresponding to the two sp-hybrid 13Catoms. As shown in the middle of Table I, the procedure described above produces the same clustering arrangement. Partitioning. Only the unknown spectrum undergoes cluster analysis. These clusters are then used to form partitions. A partition is a spectral region with assigned upper and lower boundaries that contains one cluster. Partitions are adjacent and consecutively cover the nominal 13C NMR spectral range of -100 to 250 ppm. Partitioning the spectrum provides two advantages. First, it is possible for two spectra to be similar within certain regions of the spectrum and yet dissimilar in others. However, similiarity within a region indicates a common substructural unit. Thus, partitioning enables the search to extract such information. Second, the position of a given chemical shift in a certain spectral,region may tend to vary more than in other regions according to chemical environment. Therefore, it is advantageous to have a different tolerance value for determining matches or mismatches between spectral lines. Partitioning allows this tolerance factor to be tailored based on spectral information within the two corresponding partitions. Setting the partition boundaries first involves calculating the mean and standard deviation of each cluster. A distance equal to one standard deviation is subtracted from the first shift of the cluster and added to the last shift of the cluster, forming an initial set of boundaries. Since it is desirable to have one boundary between two adjacent clusters, overlaps or spaces between the initial boundaries of the two are resolved, assigning the shared boundary to be the average of the initial boundary values. The lower boundary of the first partition is defaulted to be -100 ppm, and the upper boundary of the last partition is defaulted to 250 ppm. A one-line spectrum is defined to be contained in one partition covering the -100 to 250 ppm range. In order to prevent the inadvertent analysis of a spectral line more than once, the boundary values between adjacent partitions must be distinct instead of actually shared. Therefore, the shared boundary value is assigned to be the upper boundary of the lower adjacent partition. An offset (0.01 ppm) is added to the shared boundary value, producing the lower boundary of the upper adjacent partition. Calculated partition boundary values for the 1-hexyne example are displayed in the last section of Table I. Slide Factor a n d Match Quality Index. The quality of the match between two spectral lines is judged by comparing the distance between the lines to some value defining an acceptable match. This value is the partition slide factor and is determined based on the characteristics of a particular partition of the unknown spectrum, as well as the characteristics of the corresponding partition in a given library spectrum. Each partition has a separate slide factor that is continually adjusted before each library comparison. The slide factor calculation considers two factors: (1)the number of spectral lines within each of the two corresponding partitions; (2) the diameter of the cluster within each partition. The slide factor is calculated by using eq 1. slide =

largest cluster diameter (ppm) least number of lines

(1)

In the context of the overall search procedure, the slide factor is used to identify outlying spectral shifts that are

ANALYTICAL CHEMISTRY, VOL.

60,NO. 18, SEPTEMBER 15, 1988

1889

1 z

Matrix Row S o r t

Flgure 2. Illustration of matrix construction and matrix manipulation used to obtain best matches. Circles represent the position of row column pointers used to indicate best matches with each row peak.

Figure 3. Leftmost portion of an 11 by 14 sorted matrix containing several multiple match cases. Circles represent the position of row column pointers that indicate selected best matches.

obvious mismatches with any of the shifts in the partition being compared. Therefore, the value of the slide factor should not be extremely small or excessively large. The derived formula for calculating the slide factor seems to yield an acceptable intermediate value. As all possible match combinations are recorded during the mapping procedure, a quality index is assigned to each based on the slide factor. This quality index is used later by the scoring section of the search. If the distance between two compared spectral lines is less than or equal to 0.10 ppm, the match is assigned a quality index of 1, indicating an exact match. The fO.10 ppm tolerance was incorporated to account for slight instrumental variations. If the distance between two compared spectral lines is less than or equal to the slide factor, the match is given a quality index of 2, indicating a "good" match. If the distance between the two spectral lines being examined is greater than the slide factor, the match has a quality index of 3, indicating a mismatch. . Line Mapping Algorithm. As mentioned previously, each spectral line in one partition is matched to every spectral line in the corresponding partition. A record of each match and information about the particular match are stored as an element in a matrix containing information about all possible match combinations. The format in which the matrix is constructed depends upon the number of spectral lines contained in each of the two partitions being compared. The maximum number of matches that can be made is given by the number of peaks in the partition containing the least number of peaks. Each peak in this partition will represent a row in the matrix. The partition with the least number of spectral lines is always mapped onto the partition with the greater number of spectral lines. Each peak in the latter will represent a column in the match matrix. In the event that both partitions contain the same number of shifts, the partition with the largest cluster diameter is mapped onto the corresponding partition with a smaller cluster diameter. If there are no shifts present in either partition or in both partitions, no mapping is performed, and the scoring section is used to resolve the problem appropriately. The first upfield shift in the partition containing the least number of lines is consecutively matched, upfield to downfield, with each shift in the corresponding partition. Each match forms an element in row one of the matrix. Each element contains three numbers, termed row element indices. These are (1) match quality, represented by a 1, 2, or 3, (2) the difference (ppm units) separating the two lines, and (3) the column number identifying the match. After all possible match combinations have been examined and recorded, each row of the matrix is sorted from least to greatest distance, based on the second index of each row element. An example matrix, along with the sorted version of the matrix, is shown in Figure 2. This is an actual matrix for the second partition (70.13-250.00 ppm) of 1-phenyl-1-propanone mapped onto a higher resolution spectrum of 1-phenyl-1-propanone. These

two spectra are discussed later in the evaluation of search results. Selection of Best Matches. A column pointer is associated with each row of the matrix and the position of the pointer indicates the best match for a given row peak. After the matrix is sorted from least to greatest distance, the element in the first column of each row ideally represents the best match for a given row peak. Therefore, the column pointers are all initialized to be in column 1. The positions of the pointers are indicated by circles in the sorted matrix of Figwe 2. By inspection, it is clear that the peak corresponding to row 1 was matched with peak 1 in the compared partition, row peak 2 with column peak 3, row peak 3 with column peak 4, and so on. The peak corresponding to column 2 was not matched. The presence of this extraneous peak indicates some degree of dissimiliarity between the two patterns of spectral lines being compared. The algorithm finds and identifies this case by keeping a record of the column peaks that have not been matched. A penalty for each extraneous peak is assigned later in the scoring section. In most cases, the selection of best matches is much more complex. In many situations, more than one row peak will ideally match best with the same column peak, resulting in a situation that will be termed a multiple match case. In such cases, the algorithm proceeds by examining each column peak and the number of row peaks that have been matched to it. If only one peak has been matched with a given column peak, that is the best and only match possible, and the column pointer for the row remains unchanged. In a multiple match case, the algorithm scans the multiple matching rows in search of some pattern based on the third index of each row element. This pattern scan is more easily explained by stepping through the scanning procedure for an actual sorted matrix given in Figure 3. The matrix represents the mapping of the second partition (31.62-94.06 ppm) of 22(R)-aminocholest-5-en-3P-o1 onto the corresponding partition of cholest-5-en-3P-ol. For convenience, only the first six columns of the sorted matrix are shown. Only one possible match exists for the peak corresponding to column 1. Column peaks 2,3, and 4 have not been indicated as matches with any row peaks, and therefore, column peak 5 is the next one investigated. Row peaks 2 and 3 both match best with column peak 5, indicating a multiple match case. One possible pattern would be indicated if the second element of row 3 had a third index of 6, indicating that row peaks 2 and 3 could be matched with column peaks 5 and 6, respectively, if row peak 3 were to be assigned its next best match. This pattern would correspond to two spectral shifts that have been moved slightly upfield or downfield due to some structural, solvent, or experimental artifact with respect to the two lines in the corresponding partition that have not been moved. Finding no such pattern, the algorithm proceeds to investigate another pattern indicative of the same situation just described. This pattern would be indicated if the second element of row 2 had a third index of 4, indicating a match between row peaks 2 and 3 and column peaks 4 and 5, respectively. This case is

1890

ANALYTICAL CHEMISTRY, VOL. 60, NO.

18, SEPTEMBER 15, 1988

detected, and row peak 3 is therefore selected as the best match with column peak 5, and the column pointer of row 2 is incremented, indicating row peak 2 as the best match with column peak 4. If no patterns were detected, row peak 2 would have been selected as the best match with column peak 5 since it is the closest based on distance (smaller second index), and the column pointer for row peak 3 would have been incremented, thereby resulting in one distinct match with column peak 5 . Note that mulitple matches also occur with column peak 9. In this case, a 9-10 pattern is found, resulting in matches of row peaks 6 and 7 with column peaks 9 and 10, respectively. Another multiple match occurs between row peaks 8 and 9, both indicating a match with column peak 11. Again, the same pattern scanning procedure is employed to find an 11-12 pattern with row peaks 8 and 9. An additional mbltiple match now exists between row peaks 9 and 10 to column peak 12. The position of the row 9 column pointer in regard to the scanning procedure is unimportant. Just as before, the pattern scan proceeds by checking the third index of the next adjacent element of the second multiple match. The row 10 column pointer is initially set a t column 1. Thus, row 10, element 2, index 3 is the first location investigated by the scanning procedure. A 12-13 pattern is detected. The row 9 column pointer therefore remains unchanged at the second element, indicating column peak 12 as the best match with row peak 9. The row 10 column pointer is incremented, resulting in a match between column peak 13 and row peak 10. After all row matches for each column peak have been examined, it is clear that column peaks 2, 3, and 8 are extraneous peaks. The matching algorithm thus keeps a record of the number and position of each extraneous peak and passes this information to the scoring procedure for evaluation. Scoring Procedure. The scoring procedure is divided into two sections: (1)the match penalty section, which assigns an appropriate penalty to “good” matches and mismatches; (2) the structure penalty section, which assigns a penalty to extraneous peaks identified by the mapping algorithm. Before either of these sections is executed, the scoring procedure first investigates cases where one or both partitions contain no peaks. The partitioning algorithm for a normal search will never assign an empty partition. The case of two blank partitions can only occur when the automatic clustering and partitioning are overridden by the user and a blank partition is manually assigned within the unknown spectrum. If both partitions contain no peaks, the two partitions are identical and no match penalty or structure penalty is acquired. However, if only one partition is empty, a major difference in overall structure is indicated. The lines in the nonempty partition are counted and multiplied by a weighted penalty factor. The penalty weighting factor becomes larger as the number of lines contained in the nonempty partition increases. The number of lines times the appropriate weighting factor yields a final score for the partition. The normal match penalty and structure penalty sections are not executed for this special case. As outlined, the position of the column pointer for each row indicates the best match for each row peak. Scores for each match are determined by examining the distance separating the two matched spectral lines. This distance is the second index of the row element a t which the column pointer is positioned. If this distance is less than or equal to 0.10 ppm, also indicated by a first index of 1,the match is considered to be exact, and no penalty points are assigned. If the match is a “good” match or a mismatch, the penalty is equal to the distance between the lines. The matches for each row peak are evaluated in this manner, resulting in an overall sum-

Table 11. Search Results name

hit

1 2 3 4 5

Search 1 Target: 1-Phenyl-1-propanone Number of Spectra Compared: 7198 1-phenyl-1-propanone 1-phenyl-1-propanone

score

0.000

1-(2,5-dimethylphenyI)ethanone

9.630 19.700

2-methyl-1-phenyl-1-propanone

21.300

1 (2-methylpheny1)ethanone

21.500

~

Search 2 Target: 4-Chloro-l-(phenylmethyl)-1H-pyrazole Number of Spectra Compared 7198 1 2

3 4 5

1 2 3 4

5

4-chloro-l-(phenylmethyl)-1H-pyrazole 1-(phenylmethy1)-1H-pyrazole 3-chloro-l-(phenylmethyl)-lH-pyrazole 5-chloro-l-(phenylmethyl)-1H-pyrazole (dimethoxymethy1)benzene

Search 3 Target: Cholest-5-en-3-01 Number of Spectra compared: 7198 cholest-5-en-38-01 22(R)-aminocholest-5-en-3P-o1 22(S)-aminocholest-5-en-3~-01

cholesta-5,7-dien-38-01 cholesta-3.5-diene

0.000

11.000 12.000 14.200 18.000

0.000

27.080 33.620 35.130 39.320

mation of match penalty points. Extraneous peaks are handled by the structure penalty section of the scoring procedure. The penalty for each extraneous peak is equal to the distance to the closest peak in the corresponding partition. The penalties for each extraneous peak are summed, resulting in an overall structure penalty score. This procedure minimizes the penalty in the case of an extraneous peak that results from differences in spectral resolution. The match penalty score and the structure penalty score are then added to produce a score for the partition. Each partition in the unknown is evaluated by using the same procedures as outlined above. The scores for each partition are summed to yield a total score for the overall spectral comparison, and these scores are sorted from smallest to largest in a continuously updated “hit list”. Evaluation of Search Results. The goal in evaluating search results is to determine the degree to which the structural integrity of the unknown is preserved and identified. In each of the results presented, a library spectrum was selected as the “unknown”, thus permitting better evaluation of the resulting hit list of best matches. Search 1. The spectrum of 1-phenyl-1-propanone was selected as the target spectrum. The resulting hit list of the five best matches is presented in Table 11, and the corresponding spectra and structures for each compound are given in Figure 4. Hits 1 and 2 are both spectra of l-phenyl-lpropanone recorded under different experimental and solvent conditions. The f i i t spectrum represents a neat sample, while the second spectrum was recorded in CDC13. Both spectra were collected on different instruments. Despite the differences in shift position and number of shifts, the search still accurately identifies the target compound. The remainder of compounds on the hit list are all 1-phenyl-1-ketones,correctly identifying the appropriate class of the target compound. Search 2. The spectrum of 4-chloro-l-(phenylmethyl)1H-pyrazole was selected as the target spectrum. The hit list of best matches is presented in Table 11, and the spectrum and structure of each compound are illustrated in Figure 5. The second best match for the spectrum of the target compound is the spectrum of 1-(phenylmethy1)-1H-pyrazole,the

ANALYTICAL CHEMISTRY, VOL. 60, NO. 18, SEPTEMBER 15, 1988

I

1-(2,SDime~enyi)-ethmone

lI

1891

I1

Figure 4. Broad-band decoupled ‘‘C NMR spectra and structures of compounds on the hit list for search 1.

same compound minus chlorine substitution in the fourth position. The third and fourth best matches correctly identify compound class and also identify the presence of chlorine substitution. Hits 3 and 4 are the only other occurrences of chloro-substituted 1-(phenylmethy1)-1H-pyrazolesin the library. The fifth best match is useful in identifying the presence of a phenylmethyl structure in which the methyl 13C

atom is greatly deshielded. Even though hit 5 does not identify the compound class, it does provide clues about the structure of the target compound. Search 3. In order to test the search algorithm more rigorously, a target compound of considerably greater complexity was investigated. The 27-line spectrum of the steroid cholest-5-en-3/3-01, commonly named cholesterol, was used as

1892

ANALYTICAL CHEMISTRY, VOL. 60, NO. 18, SEPTEMBER 15, 1988

, '140

'130

'120

'110

'100

90

80

>0

50

60

1-@enyhethyi)-lH-pyrade

!

CH,

I

3-Chloro-l-(phenytneth$)-lH-pyade I

1

1 '140 140

I

;

'130

'120 120

'110

,

I, 100

b0

80

'/0

k0

1

5(

!

I

H-C-OCHs

I

OCHI

k40

130

'120

'110

100

90

the target spectrum. The search results are also presented in Table 11, and the spectra and structures of all compounds on the hit list are given in Figure 6. Hits 2 and 3 correctly identify the main structure and class of the target compound and differ only by an amino group in the 22-position. The fourth best match accurately preserves information about the 3P-conformation of the hydroxyl group as well as the presence

80

70

'60

k0-

of a double bond at the fifth position and differs only by an extra double bond at the seventh position. The last match, cholesta-5,7-diene,does not reflect the appropriate compound class; however, it does correctly isolate the main backbone of the steroidal structure of the target compound. Reverse Search. Often an unknown spectrum is not contained in the library, and in some cases, the library does

ANALYTICAL CHEMISTRY, VOL. 60, NO. 18, SEPTEMBER 15, 1988

1893

HO

Flgure 6. Broad-band decoupled I3C NMR spectra and structures of compounds on the hit list for search 3. not contain spectra of compounds that are similar to the unknown. In this situation, no helpful information can usually be gained. Similarly, a spectrum that actually represents an isomeric mixture or that contains spectral components due to impurities will not likely match with any library spectrum, even if the appropriate compound class is well-represented in the library.

In such cases, a reverse search strategy is useful. First developed for mass spectral library searches (IO), a reverse search looks for spectra that are subsets of the unknown spectrum. These subset spectra can then be used to identify specific substructural units of a complex molecule whose spectrum is not contained in the library, or the subset spectra may be used to identify mixture components and to analyze

1894

ANALYTICAL CHEMISTRY, VOL. 60, NO. 18, SEPTEMBER 15, 1988

samples containing impurities. Laude and Wilkins have applied these concepts previously with good success to I3C NMR library searching of mixture data (11, 12). In our approach, the procedural details of the reverse search differ only slightly from those of the normal forward search. The methodology used for slide factor calculation, line mapping, matrix construction, and selection of best matches remains exactly the same. The only two modifications are in the partitioning and scoring sections. Reverse Search Partitioning. Cluster analysis using the Fisher algorithm is performed on the unknown spectrum, as before, but the assignment of partition boundaries is different. A distance corresponding to one standard deviation of the cluster is added to the first and last peak positions within the cluster to form an initial set of partition boundaries. Overlapped boundaries are resolved by using the mean of the two boundary values as the new shared boundary. The blank sections between two partitions are not resolved, and the total spectral region examined is not defaulted to be -100 to 250 ppm. This method of partitioning requires a library spectrum to have peaks within one or more of the distinct partitions in order to be classified as a valid subset. Reverse Search Scoring. The scoring procedure consists of three sections: (1)a match evaluation section for selecting the best matches; (2) a subset evaluation section that assigns a penalty to each extraneous peak in the library partition being compared to the corresponding partition of the unknown spectrum; (3) a section for assigning a penalty to each extraneous peak of the library spectrum lying outside the specified partition boundaries. These three scoring features, working in combination, assure the best selection of library spectra that are true subsets of the unknown spectrum. In the match evaluation section, exact matches and ”good” matches are assigned credit points that are subtracted from a large constant score which is assigned initially. In this way, those spectra having the greatest number of “good” and exact matches acquire a lower score, indicating greater spectral similarity. It is possible for a given partition in the library spectrum to have a large number of valid matches based solely on the fact that it contains many more shifts than the unknown. However, the set of peaks within this library partition cannot be considered a true subset, and the extraneous peaks that have not been matched should acquire a penalty, the magnitude of which should also be great enough to offset the credit points given to all the “good” and exact matches. This is accomplished in the subset evaluation section of the scoring algorithm. The scores from the match and subset evaluation sections of each partition are summed to produce a tentative final score. After all partitions in the unknown spectrum have been sequentially compared to the corresponding partitions within a given library spectrum, the last section of the scoring procedure checks for the presence of extraneous peaks lying external to the specified partition regions. These extraneous peaks acquire the most severe penalty because their presence clearly indicates that the given library spectrum is not a true subset of the target spectrum. This penalty is added to the sum of partition scores, producing a total score for the overall spectral comparison. The best subset spectrum of a given target spectrum is itself. Therefore, if the unknown spectrum is contained in the library, that spectral comparison will have the lowest score and the unknown will be positively identified. As before, a continuously updated hit list is kept during the search, and smaller scores are given to spectra that are better subsets of the unknown spectrum. It is important to note that the score is largely based on the number of matching lines contained in

each subset spectrum. Thus, if the target spectrum contains 10 shifts, a subset spectrum corresponding to seven matches (a seven-line subset) will result in a better score than a subset spectrum containing only four matches (a four-line subset). However, both subset spectra may be equally important in the evaluation of the target spectrum. Therefore, given an unknown spectrum consisting of n shifts, (n - 1)hit lists of the best 2- through n-line subsets are constructed in addition to the hit list of best overall matches. Clearly, the reverse search was not designed to give an absolute and definitive assessment of every unknown target spectrum. Rather, the reverse search was designed to provide the investigator with several possible leads in the analysis of a complex spectrum based on available spectral information contained in the library. Actual visual comparisons must be an important component of the analysis of results from a reverse search. Evaluation of Reverse Search Results. A 1:l mixture of 2-aminobenzoic acid and 4-aminobenzoicacid was prepared in order to test the ability of the reverse search to resolve the mixture spectrum into its component spectra. Upon examination of the reverse search hit lists, the spectrum of 2aminobenzoic acid (seven lines) appeared as hit 1, and the spectrum of 4-aminobenzoic acid (five lines) appeared as hit 4 on the list of best overall matches. Each appeared as hit 1on their respective seven- and five-line subset hit lists. After visual examination of the results of the hit lists, it was evident that these two spectra were clearly the most accurate matches. To further test reverse search capabilities, a more complex mixture spectrum of D-akrOSe was examined. In water, Daltrose mutarotates to form an equilibrium between four isomers: P-D-altropyranose, a-D-altroppanose, @-D-dtrofuranose, and a-D-altrofuranose. Therefore, the spectrum of D-altrose consists of 24 lines representing an isomeric mixture. An additional set of 67 spectra of various other monosaccharides was added to the set of 7198 library spectra in order to provide more information about the class of the target compound, thereby resulting in enhanced interpretive ability of the reverse search algorithm. The spectra of P-D-ahOpyranose and a-D-altropyranose, each having six lines, appeared as hits 6 and 7 on the hit list of overall best matches. However, they appeared as hits 1 and 2 on the hit list of best six-line subsets. The spectra of the altrofuranose forms were not contained in either the main library or the added set of 67 spectra. These results indicate the importance of performing a visual examination of the first few matches on each of the subset hit lists. The position of these two spectra on the overall hit list was purely a function of the number of lines within each spectrum. However, visual comparison revealed these two six-line spectra to be the best subsets in terms of exactness of each line match. The reverse search can also provide information about substructures of complex molecules. As structural complexity of a compound increases, the likelihood that the library contains the spectrum of the compound decreases. However, the library may contain spectra of compounds that are substructural units of the more complex target compound. The spectrum of amygdalin, a substituted disaccharide ([(S-0-pD-glucopyranosyl-@-~-glucopyranosyl)oxy] benzeneacetonitrile), was used to test the ability of the reverse search to identify substructures. An additional set of 46 spectra of various disaccharides was added to the library of 7198 spectra in order to provide the reverse search with more information about the appropriate compund class. Since the spectrum of amygdalin (18 lines) was contained in the library, the target spectrum could be positively identified. The next two matches on the list of overall best matches were 8-gentiobiose, (6-0P-D-glUCOp~anOSyl-@-D-glUCOSe(11lines)) and a-gentiobiose,

ANALYTICAL CHEMISTRY, VOL. 60, NO. 18, SEPTEMBER 15, 1988

140

'133

'120

':13

' 0C

90

80

70

EO

Flgure 7. '%NMR braod-band decoupled spectra of amygdalin and corresponding gentiobiose substructures identified by the reverse search.

(6-O-~-D-g~ucopyranosy~-a-D-glucose (11lines)). These two matches also appeared as the best matches on the list of best 11-line subset spectra. Upon inspection of the spectra and comparison of structures, it is found that the gentiobiose structure is the main structural backbone of amygdalin, and this is clearly indicated by the reverse search results. The spectra and structures of these compounds are presented in Figure 7.

CONCLUSIONS The goal of any library search is to identify an unknown spectrum, but when this is not possible, it is desirable to obtain a hit list of compounds that are most structurally similar to the target compound. The success of a search clearly depends on the size, quality, and diversity of the library used. For a given library, it is difficult to determine what the best hit list should be for a specific target compound. In this regard, Laude and Wilkins have used other spectral evidence to assess the correctness of hit lists (13))while Delaney (14) has introduced a method for comparing hit lists from a search to the results of a "standard" search. Neither method, however, finds the compounds in a library that are most structurally similar to the target compound. Based on visual results, therefore, as well as upon knowledge about the various compound classes represented in the library used here, the search is judged to be highly successful in identifying the structural characteristics of a given target compound. The reverse search

1895

provides similar capabilities for the case of impure or mixture spectra. While we have no statistical basis for judging search performance, the search results presented here are typical to those obtained from a large number of test cases. The execution speed of a library search depends on the efficiency of the search algorithm itself, the efficiency with which the algorithm is programmed, and the power of the computer on which the search is implemented. We have chosen to evaluate the execution speed of the IBM PC-AT version of the search, as this computer is widely available. The execution speed of the search is on the order of n2,where n is the number of lines in the unknown spectrum. The time required for a search includes approximately 2.5 min of overhead attributed to reading the library and updating the hit list, plus roughly 19 s per line, plus 1/2n2s, a term which considers increasing search complexity. Thus, the search of a 10-line unknown spectrum requires 6.5 min to complete. We believe the search could be further improved by the introduction of a procedure to convert search scores to some absolute scale indicative of the degree of spectral similarity or dissimilarity,thereby facilitating easier score interpretation. Further, the incorporation of additional spectral information (e.g. line multiplicity) would greatly increase the power of the search. The latter enhancement, however, must await the availability of libraries in which chemical shifts are augmented additional spectral information.

LITERATURE CITED Zupan, J.; Hadzi, D.; Marsel, J.; Penca, M. Anal. Chem. 1977, 49, 214 1-2 146. Uthman, A. P.; Koontz, J. P.; Hinderliter-Smith, J.; Woodward, W. S.; Reilley, C. N. Anal. Chem. 1982, 5 4 , 1772-1777. Schwarzenbach, R.; Meili. J.; Konitzer, H.; Clerc, J. T. Org. Map. Reson. 1976, 8 , 11-16. Miynarik, V.; Vida, M.; Kello. V. Anal. Chim. Acta 1980, 122, 47-56. Zippei, M.; Mowltz, J.; Kohler, I.; Opferkuch, H. Anal. Chim. Acta 1982, 140, 123-142. Stuper, A. J.; Jurs, P. C. J. Chem. Inf. Comput. Sci. 1976, 16, 99-105. Hartigan, J. A. Clusterlng Algorithms; Wiiey: New York, 1985. Tou, J. T.; Gonzalez, R. C. Pattern Recogrtitlon frinclples; AddisonWesley: Reading, MA, 1974. Massart, D.; Kaufman, L. The Interpretation of Analytical Chemical Data by the Use of Cluster Analysis; Wlley: New York, 1983. Mciafferty, F. W.; Stauffer, D. B. J. Chem. Inf. Comput. Sci. 1985, 2 5 , 245-252. Laude, D. A., Jr.; Wilkins. C. L. Anal. Chem. 1988, 58, 2820-2824. Laude, D. A., Jr.; Wilkins, C. L. Anal. Chem. W87, 59, 576-581. Laude, D. A.; Jr.; Cooper, J. R.; Wilkins, C. L. Anal. Chem. 1986, 58, 1213-1217. Deianey, M. F., Warren, F. V., Jr.; Halloweil, J. R., Jr. Anal. Chem. 1983, 5 5 , 1925-1929.

RECEIVED for review March 15,1988. Accepted May 9,1988. Portions of this work were presented at the 1987 Pittsburgh Conference and Exposition on Analytical Chemistry and Applied Spectroscopy, Atlantic City, NJ, March 1987.