On Pattern Matching of X-Ray Powder Diffraction Data - American

Mar 26, 2005 - We introduce a novel pattern matching algorithm optimized for X-ray powder ... has focused on classic search-and-match algorithms,4 whe...
0 downloads 0 Views 186KB Size
J. Phys. Chem. B 2005, 109, 7781-7787

7781

On Pattern Matching of X-Ray Powder Diffraction Data Igor Ivanisevic,* David E. Bugay, and Simon Bates SSCI Inc, 3065 Kent AVe, West Lafayette, Indiana 47906 ReceiVed: September 28, 2004; In Final Form: NoVember 30, 2004

We introduce a novel pattern matching algorithm optimized for X-ray powder diffraction (XRPD) data and useful for data from other types of analytical techniques (e.g., Raman, IR). The algorithm is based on hierarchical clustering with a similarity metric that compares peak positions using the full peak profile. It includes heuristics developed from years of experience manually matching XRPD data, and preprocessing algorithms that reduce the effects of common problems associated with XRPD (e.g., preferred orientation and poor particle statistics). This algorithm can find immediate application in automated polymorph screening and salt selection, common tasks in the development of pharmaceuticals.

Introduction During the development of a drug candidate, a solid form of the drug substance that provides adequate efficacy, safety, and stability must be selected.1-3 The selection process frequently involves generation and characterization of a number of different materials to determine which have appropriate properties. For example, if the drug substance is ionizable, numerous salts may be prepared and evaluated. Another option is to carry out a search for useful co-crystals of the drug substance. In addition, whether the drug substance is alone or is present as a co-crystal or salt, the material must be screened for polymorphism. During polymorph screening it is common to carry out multiple crystallizations under various experimental conditions in an effort to identify as many different forms (polymorphs, hydrates, solvates, etc.) of the material as possible. Once the different samples are generated, X-ray powder diffraction (XRPD) is frequently used as an analysis tool to identify new forms. These screening processes are often automated, resulting in hundreds of XRPD patterns that need to be examined, sorted, and analyzed. Pattern matching algorithms have been developed to automate the task of identifying new forms and mixtures among the patterns generated by a screen. The field has received considerable attention in recent years as more and more companies move to automated screening methods.1-3 Most prior work in this area has focused on classic search-and-match algorithms,4 where a database of known compounds (with all their known forms) is built over time and new patterns are compared to the known set in an effort to identify the best match. The comparison algorithms employed in the past include wavelets,5 principal component analysis, and simple (stick) peak matching. More recently, matching approaches that focus on full-profile pattern matching and multivariate statistics that can also perform quantitative analysis of mixtures were developed.6 The best algorithms are still inferior to manual human matching, primarily due to the common problems with lab-generated XRPD data that are difficult to address in fully automated algorithms (see below, Algorithm section). Our approach was to recognize that, for any given polymorph screen of a drug, most of the samples (and hence XRPD patterns) will be of the known polymorph(s) with typically a much smaller percentage containing new forms and/or inferior * Corresponding author. E-mail: [email protected].

quality data. Furthermore, since a screen will likely be performed on the same instrument(s), and under similar conditions (operator experience, calibration etc.), data will typically correlate well to other samples that are part of the same screen. On the other hand, data may well appear quite different than the known reference patterns, especially if they were generated on different instruments and prepared under different conditions, making classic search-and-match algorithms ineffective. Calculated patterns can also be of little help if preferred orientation or poor particle statistics affect the data. Preferred orientation (PO) and poor particle statistics both represent a departure from an ideal, randomly oriented, powder sample. PO is the self-alignment of crystal faces in the powder sample. Crystals can have extreme morphologies, such as plates or needles, that cause them to align in preferred directions (like sheets of paper dropped on a table). Changes in PO (the degree of self-alignment) among samples will cause significant changes in measured X-ray diffraction peak intensities. Because PO is driven by crystal face morphology, families of diffraction peaks will show correlated intensity changes. When samples have similar crystal morphologies, PO can be reproduced from one sample to another, but the measured powder pattern will not be consistent with the theoretical powder pattern. Poor particle statistics occurs when there are not enough crystals in the illuminated sample volume to ensure that all possible crystal orientations are represented. Whenever small amounts of material are used or the sample is predominantly made up of large crystals, then poor particle statistics will occur. Being a product of incomplete statistics, poor particle statistics is random in effect and causes unpredictable changes in peak intensity from one sample to another. When poor particle statistics is driven by large crystals, it is often accompanied by apparent peak splitting and peak shifting in the XRPD pattern. The goal of our algorithm is to group the vast majority of the uninteresting data (known forms) while making sure there are little or no false positivessall the new forms as well as questionable data should be grouped separately and presented to the operator for verification. This approach lends itself to a well-known technique called hierarchical cluster analysis (HCA)7,8 that has been applied to many different types of matching problems in the past but has only recently seen common use in XRPD pattern matching.7 It also stands in contrast to the usual search-and-match approaches as it does not require any reference patterns.

10.1021/jp0455935 CCC: $30.25 © 2005 American Chemical Society Published on Web 03/26/2005

7782 J. Phys. Chem. B, Vol. 109, No. 16, 2005 To handle the variety of XRPD patterns generated by different instruments, we designed a robust similarity metric based around the comparison of peak positions with heuristics that improve the quality of the match for some commonly encountered data artifacts. The algorithm is further assisted by a user interface that allows quick and easy verification of results and manual corrections. Algorithm Preprocessing. Lab-generated XRPD data suffers from many instrument, sample and/or operator-induced problems. Common ones include PO, poor particle statistics, noise, enhanced or irregular baseline, peak displacement, poor calibration, constant theta-shift due to sample preparation, irregular peak widths etc. Some instrumental effects have obvious manifestations in the data (e.g., noise, baseline, or constant θ shift) and can be corrected with automatic algorithms. Others, such as PO and poor particle statistics, are difficult to spot even for experienced scientists and are frequently the cause of misclassification in automatic algorithms. Advances in lab instrumentation are unlikely to correct these problems soon given the time constraints and small sample sizes in a typical polymorph screen. Therefore, we have to deal with imperfect data but also apply preprocessing techniques to alleviate some of these problems. The preprocessing in our algorithm includes the following stages: Normalization. Intensities in each pattern are scaled to a range of [0,1]. Truncation. Patterns are reduced along 2θ to a user-specified region of interest, typically 2.5-40°. For organic molecules, especially the larger molecules, data beyond 40° 2θ (when available) suffers from the rapid reduction of the electronic form factor and uncertainties in the long-range packing of the molecule. Smoothing. Savitzky-Golay9 or digital filter10 smoothing is used to reduce the level of noise in the data. The algorithm automatically adjusts smoothing parameters depending on the level of noise detected in the pattern, as determined by the variance method. If the user opts to match amorphous and crystalline components of the pattern, different smoothing parameters are used for the two (generally, heavier smoothing is applied in the amorphous case). Baseline Correction. A variable window/pass digital filter is used to remove the baseline of each pattern. The window automatically adjusts based on the position in the pattern, allowing the algorithm to remove irregular instrumental baselines that tend to be more pronounced at the beginning of the pattern (e.g. below 5° 2θ). Hierarchical Cluster Analysis. Starting with a set of N items (XRPD patterns in our case), and an NxN similarity matrix describing the relative similarity of each item to each other item, the basic process of HCA7 is this: 1. Initially assign each item to its own cluster, producing N clusters, each containing one item. Let the similarities between the clusters equal the similarities between the items they contain. 2. Find the most similar pair of clusters and merge them into a single cluster, resulting in one less cluster (for an initial total of N-1 clusters). 3. Compute similarities between the new cluster and each of the remaining old clusters. 4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N. Each merge operation can be considered as a branch in a tree of clusters. This tree is called a dendrogram and has its root in the final cluster that contains all N items. The leaves of the tree are the initial N single item clusters.

Ivanisevic et al. Step 3 can be done in different ways, resulting in different cluster distance metrics. Some of the most commonly used cluster distance metrics are single-link, complete-link and average-link. Those three metrics use greatest, smallest, and average similarity, respectively, from any item in one cluster to any item in the other cluster as the similarity between two clusters. Our experience has shown the complete and average links to be useful cluster distance metrics for XRPD data, with the former producing the strictest sort (least chance of different forms being grouped in the same cluster) but typically producing more clusters than there are actual distinct forms. Additional background information, other widely used cluster distance metrics and a more efficient algorithm for certain types of cluster distance metrics can be found in the literature.11 Similarity Metric. The core requirement of HCA is the derivation of a measure of similarity between the objects being clustered. The success or failure of the HCA approach is entirely dependent on the robustness of the measure of similarity chosen. In our algorithm, the measure of similarity is a comparison of peak positions, modified by heuristics. The peak detection stage produces a list of peak top positions as well as boundaries of the peak range, the centroid, and a probability factor (PF) that the peak is truly present. On the basis of the PF, the peaks are allocated to four discrete groups: group 1 comprises the “important” peaks (PF ) 100%), group 2 “regular” peaks (50 < PF < 100%), group 3 major trace peaks (25 < PF < 51%), and group 4 minor trace peaks (0 < PF < 26%). When the distance between any two patterns is being computed, the group 1 and 2 peaks from both patterns are being compared. A group 1 peak in one pattern has a match in the other pattern if there is a group 1, 2, or 3 peak in the other pattern close enough in position. Closeness is measured by the distance between peak apex or centroids, whichever is closer, and is a user-adjustable parameter. A group 2 peak can be matched to any group 1, 2, 3, or 4 peak in the other pattern. Each mismatched peak results in a penalty applied to the similarity between the two patterns, adjusted by heuristics. The default penalty for group 1 peaks is twice the penalty for group 2 peaks. The following heuristics can relax a penalty for a missing peak if the user allowed their inclusion prior to running the algorithm. Note that many of the heuristics use the full peak profile information (as measured in the pattern, not fitted) and could not be used if the peak detection algorithm did not reliably return the entire range of the peak. Pattern Area OVerlap. This heuristic may apply if the area of the missing peak is almost entirely overlapped by the corresponding area in the other pattern. Such situations can occur, for example, when a trace peak exists at this location in the other pattern that just missed the detection threshold (e.g., due to its shape or height). If detected, this heuristic completely removes the matching penalty for that peak, as it’s really correcting the failure of the peak detection algorithm. In theory, we could lower the thresholds in peak detection to identify such peaks and eliminate this heuristic. In practice, doing so produces too many false peaks that have a very negative effect on the accuracy of this algorithm. Partial Peak OVerlap. This heuristic may apply if there is an unmatched peak in the other pattern that is reasonably close to the missing peak and their areas partially overlap. Peak displacement due to instrumental effects is a common phenomenon in laboratory XRPD data that can cause such situations. An additional constraint for this heuristic is that it will only apply if there is a small total number of mismatched peaks

X-Ray Powder Diffraction Data

J. Phys. Chem. B, Vol. 109, No. 16, 2005 7783

Figure 1. Crystalline peak detection algorithm example. The group assignments are shown for some of the detected peaks.

between the two patterns. If partial peak overlap is detected, this heuristic will reduce the penalty but not eliminate it, as we cannot be certain that this is a case of peak displacement and not a true mismatch. Twin and Sister Peaks. XRPD data of pharmaceuticals frequently contain multiple overlapping peaks. Those peaks often manifest themselves as bumps on the side of a larger peak (sister peaks) or split peak tops (twin peaks). The magnitude of the bump or split top can vary greatly from pattern to pattern, which means the peak detection algorithm will not always identify such peaks as separate entities. If a missing peak is classified by the algorithm as a sister or twin peak, this heuristic examines the other pattern for the presence of a large peak in the same theta range. If found, the heuristic further looks for a break in the slope on the corresponding side of the large peak and within a specified distance of the sister/ twin peak. The presence of a break in the slope eliminates any penalty associated with this missing peak while simple presence of a large peak nearby reduces any group 1 penalty to group 2. Total Peak OVerlap. The previous heuristic handles many of the peak overlap conditions that occur in XRPD data but is limited to sister/twin peaks and does not apply to free-standing peaks. However, it is possible that a corresponding peak in the other pattern is completely enclosed by a large peak to a point where no visible trace of it remains other than a larger base than is normal for a peak of that size. This heuristic applies for group 2 peaks that share the same theta range as a larger peak in the other pattern and partially overlap in area. If total peak overlap is detected, this heuristic eliminates the penalty associated with such peaks. After the heuristics have been applied to all unmatched peaks, the penalties are summed to give the total similarity score for the two patterns. By default, up to three group 2 peaks can be missing for two patterns to be part of the same cluster, assuming the strictest clustering metric. A large arbitrary penalty is assigned to matches between patterns containing amorphous peaks and pure crystalline patterns (likewise for amorphouscrystalline mixtures with respect to both pure categories). Finally, if the user enables the option, the algorithm automatically tries to detect a near-constant theta shift between the peaks of two patterns. The size of the shift and percentage of peaks exhibiting it are parameters. If such a shift is detected, the

algorithm will shift one of the patterns by the same amount and re-run the similarity metric. If a better score is produced, it is kept in place of the unshifted score. Peak Detection. There are two different peak detection algorithms employed, one for crystalline and another for amorphous peaks (the latter is only run if the user specified the amorphous matching option). Crystalline Peak Detection. The crystalline peak detection algorithm is a sweep for local maxima that exceed a threshold among neighboring points. The threshold is set automatically and adjusted by the noise level in the pattern. As soon as a local maximum satisfying the threshold criteria is found, the algorithm attempts to identify the beginning and end of that peak. It does so by descending the edges on each side of the apex, looking for a typical roof peak shape or a plateau peak shape. The stopping conditions in the descent include change of slope for a number of consecutive points and approaching sufficiently close to the baseline. Other shape-specific stopping conditions include failure to descend a sufficient amount over a minimum range (for roof-shaped peaks) or failure to drop significantly near the ends (for plateau-shaped peaks). Once the boundaries of the peak have been determined by the above procedure, the probability that this is a real peak is computed (see below). If both shapes (flat and roof) produced viable boundaries for this peak, only the higher-probability one is kept. The algorithm then picks up from the end of the current peak range and continues with the sweep for local maxima. This algorithm is likely to miss tiny peaks sitting on the slope of a larger one but that is intentional as we have heuristics to deal with that situation. Each peak detected by the above procedure has a probability assigned to it based on a sum of factors. The contributing factors can be positive or negative and include peak height, width at half-maximum, and its placement in the “neighborhood”. Finally, corrections are made for extreme cases where the above formula may fail (e.g., small peaks are restricted in how high a probability they are assigned regardless of other considerations). Peaks with a probability of zero or less using the above factors are discarded as false peaks. Others are kept and will be grouped by the pattern matching algorithm into the aforementioned four discrete groups. Figure 1 shows the crystalline peak detection algorithm at work.

7784 J. Phys. Chem. B, Vol. 109, No. 16, 2005 Amorphous Peak Detection. This algorithm starts with somewhat different preprocessing settings in the steps described earlier. Generally, the baseline correction is greatly reduced to avoid eliminating the amorphous halo and the noise smoothing is greatly increased (as pure amorphous patterns are typically very noisy). The algorithm itself consists of sweeping along theta looking for regions of minimum width where all points are at least one threshold above the sweep line in intensity. Allowances are made for the edges of such regions but the middle must be continuous. In addition, even if no amorphous peaks were detected, a pattern is declared amorphous in the absence of any major crystalline peaks. Postprocessing and User Interface. At the end of the algorithm, the user is presented with the groupings (clusters) produced under the default similarity metric parameters. The user interface can also display a dendrogram, allowing the user to take different “cuts” of the cluster tree and produce different groupings. Individual patterns and whole clusters can also be manually reclassified regardless of their algorithm assignment. Upon algorithm completion, the user also has the option to run a mixture identification algorithm that examines a representative pattern from each cluster and compares it to combinations of representative patterns from other clusters, in the hopes of identifying mixtures of clusters. The mixture algorithm simply looks for the presence of all peaks from multiple patterns in a single pattern. Other components of the user interface allow quick comparison between user-selected reference patterns of various clusters, multicomponent mixture visualization as well as the usual data visualization techniques (shifting, zooming, etc.). The mixture visualization tool allows the user to create a composite of multiple cluster reference patterns and compare it to reference patterns of potential mixture clusters. It should be noted that the proposed similarity metric lends itself to more natural user interaction when it comes to cutting the dendrogram, since the user simply needs to specify the number of missing peaks allowed by the algorithm to control the placement of the cut. This stands in contrast to more traditional algorithms that typically produce abstract similarity scores on a scale from 0 to 1 that are relative to each data set and hard to quantify before running the algorithm. The algorithm can currently handle 1000 patterns in 15 min on a 1.2 GHz Pentium 4 with 512 MB of RAM, with the heuristics and auto-shifting option enabled (this more than doubles the run time, on average). The current version retains all pattern information in memory, making it unsuitable for problems involving many thousands of patterns unless specialized computers are used or a parallel HCA implementation is developed, similar to the one described in the literature.11 The current implementation with all options turned on appears to run in time proportional to the square of the number of input patterns (O(n∧2)), but it has not been optimized. Results and Discussion Overview. This section presents results from our evaluation of the algorithm’s performance. The algorithm was run on selected data sets and its output was compared to the manually assigned classification. In addition, we ran an implementation of the traditional sum of the squared and weighted differences algorithm,12,13 Rwp, on the same data sets for comparison purposes. This algorithm included some preprocessing similar to that used by the new algorithm in an effort to minimize instrumental effects. Default algorithm settings were used in all cases, except as noted below. No user-corrections were made to the algorithm results; i.e., the default (automatic) dendrogram

Ivanisevic et al. cut was used when evaluating results. As mentioned earlier, for the new algorithm, this corresponds to patterns matching if the equivalent of up to three group 2 peaks are different between them. For the Rwp algorithm, the default dendrogram cut was a similarity of 0.9, on a scale of 0-1, with 1 being “most similar”. The default dendrogram cuts were observed to produce roughly the same number of clusters for both algorithms and all data sets. We chose six data sets to test the algorithm. The criteria for selecting the data sets included size, the availability of a humanassigned classification, plurality of forms and/or mixtures, and presence of common XRPD artifacts (e.g., noise, PO/ poor particle statistics, theta shift). The goal was to test data sets that were a reasonable representation of typical polymorph screen data. The data sets are described below in more detail. Set 1. This set contains 175 patterns and is the largest of the six. A total of seven forms and one mixture are present, with the majority of patterns (84%) being form 1. The other six forms are roughly uniformly distributed in the set. Most of the data were collected on an INEL PSD diffractometer but some Bruker GADDS and Shimadzu-Bragg-Brentano data are also present. This set is a good example of typical polymorph screen data, with all of the different XRPD artifacts represented. Amorphous matching was turned off for this data set as it contains no amorphous data but some patterns have odd baselines that could be detected by the algorithm as amorphous. Set 2. This set contains 50 patterns, all collected on the Bruker system. There are eight forms and two mixtures, including some amorphous data. The various forms and mixtures are roughly evenly distributed and three of the forms are very similar. This set tests the algorithm’s ability to handle wide peaks as well as very pronounced baselines. Representative patterns for five of the eight forms in this data set are shown in Figure 2. Set 3. This set contains 21 patterns, mostly collected on the Shimadzu system with some INEL data as well. There are four crystalline forms, one crystalline mixture, one amorphous form and one amorphous-crystalline mixture. Amorphous matching and performance on near-ideal data are the main purposes of this test. Set 4. This set contains 77 patterns, mostly Shimadzu with some INEL data. There are eleven forms, one mixture and some amorphous data. The two main forms cover 50% of the data. This set contains examples of extreme noise, sharp peaks, broad (disordered) peaks, heavy PO/poor particle statistics (entire families of peaks are missing) and peak displacement. Most of the patterns of the primary form suffer from heavy PO/poor particle statistics. Set 5. This set contains 68 patterns, again mostly Shimadzu with some INEL data. There are four forms, three mixtures and some amorphous data. This set is similar in nature to set four, except it has fewer patterns with heavy PO/poor particle statistics, and contains more mixtures. Sets four and five are meant to test the algorithm’s ability to handle a variety of different data within the same problem set. Set 6. This set contains 23 Shimadzu and INEL patterns. One form dominates with one or two patterns for each of the other three forms (no mixtures). The main two forms are very similar with just three to four peaks different between them. Some of the data are also theta-shifted by roughly 0.15° and the set is a good test of the auto-shift feature. Evaluation Criteria. We wanted to evaluate the new algorithm using the two common cluster distance metrics (complete and average link, see the Algorithm). In addition, we wanted to evaluate the impact of the heuristics on the quality

X-Ray Powder Diffraction Data

J. Phys. Chem. B, Vol. 109, No. 16, 2005 7785

Figure 2. Five forms from test data set two. Note that some of the patterns have been y-shifted for clarity in comparison, they all share a similar baseline.

TABLE 1: New Algorithm, Complete-Link, Using Heuristics

TABLE 5: Rwp Algorithm, Complete-Link

data set

1

2

3

4

5

6

data set

1

2

3

4

5

6

M N % match % match norm. % work

2 18 98.9 92.9 88.6

2 1 96.0 94.9 94.0

1 1 95.2 93.8 90.5

0 14 100 100 81.8

0 7 100 100 89.7

0 1 100 100 95.7

M N % match % match norm. % work

7 30 96.0 75.0 78.9

8 9 84.0 79.5 66.0

1 3 95.2 93.8 81.0

7 19 90.9 87.3 66.2

2 23 97.1 94.6 63.2

1 2 95.7 80.0 87.0

TABLE 2: New Algorithm, Average-Link, Using Heuristics

TABLE 6: Iwp Algorithm, Average-Link

data set

1

2

3

4

5

6

data set

1

2

3

4

5

6

M N % match % match norm. % work

3 7 98.3 89.3 94.3

2 0 96.0 94.9 96.0

1 1 95.2 93.8 90.5

0 11 100 100 85.7

0 6 100 100 91.2

0 0 100 100 100

M N % match % match norm. % work

7 21 96.0 75.0 84.0

17 5 66.0 56.4 56.0

1 2 95.2 93.8 85.7

16 15 79.2 70.9 59.7

2 18 97.1 94.6 70.6

1 0 95.7 80.0 95.7

TABLE 3: New Algorithm, Complete-Link, No Heuristics data set

1

2

3

4

5

6

M N % match % match norm. % work

0 33 100 100 81.1

1 3 98.0 97.4 92.0

0 2 100 100 90.5

1 27 98.7 98.2 63.6

1 15 98.5 97.3 76.5

0 3 100 100 87.0

TABLE 4: New Algorithm, Average-Link, No Heuristics data set

1

2

3

4

5

6

M N % match % match norm. % work

1 20 99.4 96.4 88.0

1 2 98.0 97.4 94.0

0 2 100 100 90.5

0 24 100 100 68.8

1 10 98.5 97.3 83.8

0 2 100 100 91.3

of the match. Thus, four test combinations are possible for each data set, shown in Tables 1-4. In addition, the Rwp algorithm was run with complete and average link metrics for comparison purposes and its results are summarized in Tables 5 and 6. Two different evaluation metrics were used to judge the effectiveness of the algorithms. The first metric (M) counts the number of patterns that were incorrectly matched with patterns of a different form/mixture. The second metric (N) counts the number of extra clusters produced by the algorithm. An extra cluster is a cluster that contains patterns of a form/mixture that is already present in an earlier cluster. This metric was deemed necessary to evaluate the algorithm’s bias toward producing too

many clusters (thus forcing the user to do extra work in finishing the job), and is used in a measure of the overall “work” performed by the algorithm. Three percentage values are reported for the evaluation metrics: % match is computed as 100 * (T - M)/T, where T is the total number of patterns in the data set. This is the traditional accuracy measurement. % match norm. is computed as 100 * (T - X - M)/(T - X), where X is the number of patterns of the most numerous polymorph (or mixture) in the data set. This accuracy measurement removes the bias in data sets where the occurrence of one polymorph dominates all others (a fairly typical situation in polymorph screening). For example, 84% of the patterns in our data set 1 are Form 1, and a simple algorithm that classified all patterns in one cluster would therefore score 84% on the % match scale without doing any actual sorting. The same algorithm would score 0% on the % match norm. scale. % work is computed as 100 * (T - M - N)/T. As the name implies, this is a measure of the overall work performed by the algorithm and how much is left for the user to do. This new criterion is necessary to properly evaluate algorithms that rely on the user for some of the work and are often biased toward producing more clusters than there are polymorphs/mixtures. In principle, one can create a simple algorithm that sorts all patterns in a separate cluster, scoring 100% on both % match

7786 J. Phys. Chem. B, Vol. 109, No. 16, 2005 metrics, but not doing any actual work. Such an algorithm would score very low on the % work criteria. Note again that the default (automatic) algorithm output was used in these tests and it is entirely possible that better groupings could have been obtained by taking a different cut of the dendrograms manually. Discussion. As seen in Tables 1-4, the overall performance of the new algorithm is very good. The normalized match percentage ranged from 96% to 100% (avg. 98.7%) when no heuristics were used, or 89-100% (avg. 96.6%) when heuristics were used. The work performed ranged from 64% to 100% (avg. 83.9%) when no heuristics were used, or 82-100% (avg. 91.5%) when heuristics were used. The use of heuristics significantly reduces the number of extra clusters (N) generated by the algorithm from a total of 143 (17.3%) in Tables 3 and 4 to 67 (8.1%) in Tables 1 and 2. It does raise the number of mismatched patterns (M) from a total of six (0.7%) in Tables 3 and 4 to eleven (1.3%) in Tables 1 and 2. Of the 17 mismatched patterns in Tables 1-4, most (10 cases) occurred due to mixtures being matched with one of the parent forms. Mixtures sometimes have no more than three peaks from the other form(s) present and therefore would not be distinguishable from the dominant parent form under default algorithm settings (though that could change if the user reduced the number of peaks allowed in the similarity metric). Thus, the performance with respect to the M metric is likely even better than the above data suggest. The remaining seven mismatched patterns were due to peak detection failures or misapplied heuristics (five patterns) and failure to identify amorphous patterns (two patterns) and are expected when a broad range of data from different instruments is considered. It is clear from Tables 1-4 that the proposed algorithm is biased toward producing extra clusters rather than mismatching patterns, which was the intent of the design as mentioned in the Introduction. Not surprisingly, the average link method significantly outperforms the complete link method with respect to the number of extra clusters produced, 85 (10.3%) to 125 (15.1%) in total. Perhaps more surprisingly, the average link method performs roughly in line with the complete link method with respect to mismatched patterns, with totals of nine (1.1%) and eight (1.0%), respectively. In fact, if the user is willing to tolerate the slight increase in the number of mismatched patterns, the average link method with heuristics, Table 2, outperformed or tied the other configurations for all data sets, leading us to believe it is the best overall configuration for this algorithm. Looking at specific data sets with the best configuration, the algorithm was able to do a good job of handling even the most difficult set (four), clustering 86% of the patterns with no mismatches, while the numbers were above 90% for the other five sets. Since data set four contained about 10 patterns with heavy PO/poor particle statistics where families of peaks were missing, it would be difficult to improve on that score with an algorithm based on matching peaks. In general, the algorithm demonstrated that it can handle most XRPD artifacts though it is most likely to fail to cluster patterns containing heavy PO/ poor particle statistics (although it does not mismatch them). Matching of amorphous forms could also be improved, based on results from data set three. Finally, we compare the results from the new algorithm to those of the commonly used Rwp algorithm using the same data sets. The normalized match scores for Rwp ranged from 75% to 95% (avg. 85.0%) for the complete-link clustering metric, or 56-95% (avg. 78.4%) for the average-link metric. The work performed ranged from 63% to 87% (avg. 73.7%) for the

Ivanisevic et al. complete-link clustering metric, or 56-96% (avg. 75.3%) for the average-link metric. We can see that the best configuration of the new algorithm significantly outperforms the best configuration of Rwp both in terms of accuracy (avg. 98.7% vs 85.0%) and the overall work performed (avg. 91.5% vs 75.3%). One can also observe that the new algorithm is a lot more balanced in terms of its performance on a variety of data sets whereas Rwp’s overall performance was hurt by extremely poor performance on data sets 2 and 4. Conclusion This paper introduces a novel pattern matching algorithm optimized for X-ray powder diffraction data. The algorithm is based on hierarchical cluster analysis with a similarity metric built around the comparison of peak positions. It includes heuristics designed to improve performance in commonly encountered cases and a robust peak detection algorithm that assigns peaks into discrete groups based on the probability that they are actual peaks rather than noise. The performance of the algorithm was evaluated on six different data sets and shown to be comparable to that of a trained human expert (though still somewhat worse on average). Future work on this algorithm will include optimization of its use in matching other data, such as Raman and IR data. The current algorithm has already been successfully applied to such data but unique heuristics could be developed to further improve accuracy. Additional work on XRPD data includes improvement of the accuracy of the algorithm to achieve concurrently a lower level of mismatch and lower level of extra clusters produced. In a parallel software development, we are able to rapidly obtain crystallographic structural information from powder data. We are now working to incorporate structural data into the pattern matching algorithm to provide a higher level of accuracy that is not obtainable through a peak matching algorithm (potentially higher than that of a trained human expert). Acknowledgment. The authors thank Patricia Mougin, Rex Shipplett and Pat Stahly for providing us with data used in the evaluation of the algorithm. Supporting Information Available: Dendrograms for data sets 2-5 for the new and Rwp algorithms, for comparison purposes. The entire data set 2 in ASCII x y format, to illustrate the data used in testing and to allow readers to compare their pattern matching implementations with ours. This material is available free of charge via the Internet at http://pubs.acs.org. References and Notes (1) Hertzberg, R. P.; Pope, A. J. High-throughput screening: new technology for the 21st century. Curr. Opin. Chem. Biol. 2000, 4, 445-451. (2) Johnston, P. A. Cellular platforms for HTS: three case studies. Drug DiscoV. Today 2002, 7, 353-363. (3) Barberis, A. Cell-Based High-Throughput Screens for Drug Discovery. Eur. Biopharmaceut. ReV. 2002, Winter. (4) Marquart, R. G.; Katsnelson, I.; Milne, G. W. A.; Heller, S. R.; Johnson, G. G., Jr.; Jenkins, R. A Search-Match System for X-ray Powder Diffraction Data. J. Appl. Crystallogr. 1979, 12, 629-634. (5) Gurley, K.; Kijewski, T.; Kareem, A. First- and Higher-Order Correlation Detection Using Wavelet Transforms. J. Eng. Mech. 188-201, Feb. 2003. (6) Gilmore, C. J.; Barr, G.; Paisley, J. High-throughput powder diffraction. I. A new approach to qualitative and quantitative powder diffraction pattern analysis using full pattern profiles. J. Appl. Crystallogr. 2004, 37, 231-242. (7) Johnson, S. C. Hierarchical Clustering Schemes. Psychometrika 1967, 2, 241-254. (8) Borgatti, S. P. How to Explain Hierarchical Clustering. Connections 1994, 17 (2), 78-80. (9) Savitzky, A.; Golay, M. J. E. Smoothing and Differentiation of Data by Simplified Least Squares Procedures. Anal. Chem. 1964, 36, 1627-1639. (10) Hamming, R. W. Digital Filters, 2nd ed.; Prentice Hall: Englewood Cliffs, NJ, 1983.

X-Ray Powder Diffraction Data (11) Olson, C. Parallel Algorithms For Hierarchical Clustering. Parallel Computing 1995, 21, 1313-1325. (12) Young, R. A. The RietVeld Method; Oxford University Press: Oxford, 1993; p 21.

J. Phys. Chem. B, Vol. 109, No. 16, 2005 7787 (13) De Gelder, R.; Wehrens, R.; Hageman, J. A. A Generalized Expression for the Similarity of Spectra: Application to Powder Diffraction Pattern Classification. J. Comput. Chem. 2001, 22 (3), 273289.