Application of Fast Fourier Transform Cross-Correlation for the

generation of large data sets that require analysis. To assist accurate comparison of chemical signals, we pro- pose two methods for the alignment of ...
0 downloads 0 Views 265KB Size
Anal. Chem. 2005, 77, 5655-5661

Application of Fast Fourier Transform Cross-Correlation for the Alignment of Large Chromatographic and Spectral Datasets Jason W. H. Wong,*,† Caterina Durante,‡ and Hugh M. Cartwright†

Physical and Theoretical Chemistry Laboratory, Chemistry Department, Oxford University, South Parks Road, Oxford OX1 3QZ, England, and Dipartimento di Chimica, Universita` di Modena e Reggio Emilia, Via Campi 183, 41100 Modena, Italy

Preprocessing of chromatographic and spectral data is an important aspect of analytical sciences. In particular, recent advances in proteomics have resulted in the generation of large data sets that require analysis. To assist accurate comparison of chemical signals, we propose two methods for the alignment of multiple spectral data sets. Based on methods previously described, each chromatograph or spectrum to be aligned is divided and aligned as individual segments to a reference. However, our methods make use of fast Fourier transform for the rapid computation of a cross-correlation function that enables alignments between samples to be optimized. The proposed methods are demonstrated in comparison with an existing method on a chromatographic and a mass spectral data set. It is shown that our methods provide an advantage of speed and a reduction of the number of input parameters required. The software implementations for the proposed alignment methods are available under the downloads section at http://ptcl.chem.ox.ac.uk/ ∼jwong/specalign. Proteomics experiments in recent years have led to development of several “profiling” technologies whereby large chromatographic and mass spectral data sets are rapidly acquired from numerous biological samples. A frequent objective of the analysis of such data sets is the detection of signal differences due to chemical variation from sample to sample. Examples include the use of mass spectrometry (MS), to acquire spectral files representing intact or fragmented proteins and peptides.1-3 Spectral profiles of protein peaks obtained from multiple tissue samples using matrix-assisted laser desorption/ionization or the related method of surface-enhanced laser desorption/ionization (SELDI)MS are able to generate, within one experiment, data sets that are often of a very large scale.4,5 These data sets represent * Corresponding author. E-mail: [email protected]. Fax: +44 1865 275410. † Oxford University. ‡ Universita` di Modena e Reggio Emilia. (1) Hilario, M.; Kalousis, A.; Prados, J.; Binz, P. A. Drug Discovery Today Biosilico 2004, 2, 214-222. (2) Diamandis, E. P. Mol. Cell. Proteomics 2004, 3, 367-378. (3) Nescizhskii, A. I.; Aebersold, R. Drug Discovery Today 2004, 9, 173-181. (4) Aebersold, R.; Mann, M. Nature 2003, 198-207. (5) Yates, J. R. Annu. Rev. Biophys. Biomol. Struct. 2004, 33, 297-316. 10.1021/ac050619p CCC: $30.25 Published on Web 08/04/2005

© 2005 American Chemical Society

expression profiles of the relevant tissues and are being used to search for biomarkers that may serve as early indicators of disease.1,2 However, analysis of such data sets is rarely straightforward. An important consideration is the time required for analysis of large data sets. For instance, a typical data set for biomarker discovery may contain more than 200 spectra, each of which may contain tens of thousand of data points. Preprocessing of these data sets to enable meaningful comparisons between samples using chemometric or machine learning methods is also often required. A particularly significant and interesting aspect of data preprocessing is the alignment or synchronization of data signals. Misalignment may arise when spectra obtained from different instruments are compared. In fact, small run-to-run variations in the retention time, mass-to-charge ratio, etc., for any chemical are almost unavoidable. The alignment of chemical data has been the subject of a number of studies in recent years ranging from applications to chromatography,6-9 nuclear magnetic resonance spectroscopy,10-13 Raman spectroscopy,14 and MS.15,16 The majority of methods that have been applied to chromatographic data sets involve the alignment of signals through interpolation and transformation by means of warping the data using techniques such as dynamic time warping and correlation optimized warping.8,9 These algorithms employ dynamic programming that require O(N2) in time and memory, meaning that the alignment of spectra containing 20 000 data points (N) would require more (6) Eilers, P. H. C. Anal. Chem. 2004, 76, 404-411. (7) Johnson, K. J.; Wright, B. W.; Jarman, K. H.; Synovec, R. E. J. Chromatogr., A 2003, 996, 141-155. (8) Tomasi, G.; Van Den Berg, F.; Andersson, C. J. Chemom. 2004, 18, 231241. (9) Bylund, D.; Danielsson, R.; Malmquist, G.; Markides, K. E. J. Chromatogr., A 2002, 961, 237-244. (10) Forshed, J.; Andersson, F. O.; Jacobsson, S. P. J. Pharm. Biomed. Anal. 2002, 29, 495-505. (11) Forshed, J.; Schuppe-Koistinen, I.; Jacobsson, S. P. Anal. Chim. Acta 2003, 487, 189-199. (12) Lee, G. C.; Woodruff, D. L. Anal. Chim. Acta 2004, 513, 413-416. (13) Torgrip, R. J. O.; Aberg, M.; Karlberg, B.; Jacobsson, S. P. J. Chemom. 2003, 17, 573-582. (14) Witjes, H.; van den Brink, M.; Melssen, W. J.; Buydens, L. M. Chemom. Intell. Lab. 2000, 52, 105-116. (15) Wong, J. W. H.; Cagney, G.; Cartwright, H. M. Bioinformatics 2005, 21, 2088-2090. (16) Yasui, Y.; McLerran, D.; Adam, B. L.; Winget, M.; Thornquist, M.; Feng, Z. J. Biomed. Biotechnol. 2003, 2003, 242-248.

Analytical Chemistry, Vol. 77, No. 17, September 1, 2005 5655

than 1 gigabyte of memory (200002 × 32 bits). Thus, such methods are generally not suitable for the alignment of large data sets on standard personal computers. An alternative method is by alignment of signals through insertion and deletion of data points.10,12,13,15 While at risk of introducing artifacts or removing important data points, these latter methods are typically faster and are more relevant to analytical problems where it is imperative that peaks are not skewed. Thus far, a number of methods have addressed the problem of aligning large data sets where the time required for analysis can become a limiting factor.10-13,15,16 One class of algorithms attempts to speed up alignment by reducing the number of data points through peak picking.7,13,15,16 While these methods have been shown to work well on a variety of data sets, the requirement to use peak picking prior to alignment may result in misalignment of small but important peaks. In addition, in regions of the spectrum where peaks are dense, it is possible for a peak to be misaligned to due to overlapping of sample peak signals with unrelated peak signals in the reference. An alternative approach using correlation has been developed that does not require preselection of peaks.11,12 The heuristics of the method lies in first dividing a spectrum or chromatograph into a specified segment size, before applying the correlation coefficient to determine the optimal shift position for each segment against a reference. Each segment is adjusted appropriately through insertion and deletion of data points at segment boundaries, linear interpolation, or both before they are reassembled. Forshed et al.11 have implemented a genetic algorithm to find the optimal shift position that allows for some linear interpolation, while, based on the same approach, Lee and Wooduff12 implemented the heuristic beam search algorithm to find the shift position without interpolation (peak alignment by beam searching (PABS)). Despite the elimination of the necessity to pick peaks for alignment, the segment correlation methods still require that the user select an appropriate minimum segment size for alignment. While the minimum segment size can be estimated by visual inspection of the approximate signal region widths in the sequential segmentation model, this may not necessarily result in optimum alignment. An excessively large segment may neglect the alignment of smaller signals, while a very small segment size may introduce artifacts due to the insertion or deletion of data points. Considering both efficiency and effectiveness, it is desirable to develop an approach where spectral alignment can be performed with minimal human intervention. In this paper, making use of fast Fourier transform (FFT) crosscorrelation, we propose two algorithms that provide advantages in speed and accuracy over previous methods and also eliminate the requirement for alignment parameters. There are two significant advantages of the FFT cross-correlation function. First, it is perhaps the fastest method for shift estimation in signal processing. Second, the FFT cross-correlation function is not heuristic and the search result is optimal. In the following section, we briefly describe the theory behind FFT cross-correlation, describe two alignment algorithms using FFT cross-correlation, and finally demonstrate the advantages of our algorithm on a gas chromatographic (GC) and a SELDI-MS data set by making comparisons with an existing method. 5656

Analytical Chemistry, Vol. 77, No. 17, September 1, 2005

THEORY AND METHODS Fast Fourier Transform Cross-Correlation. The calculation of the cross-correlation function using FFT is a well-known method for measuring correlation and time delay or shift between 1D and 2D signals in fields such as audio and image analysis. In the scientific realm, FFT cross-correlation has been used in a variety of applications, including tandem mass spectra database searching as in the SEQUEST algorithm.17 FFT cross-correlation has the attractive property of enabling rapid calculation of the correlation between two data sets where one signal may be shifted relative to another. In addition, perhaps more significantly for the application to the spectral alignment problem, it is also able to rapidly provide an accurate estimation of the shift between two data sets. We now describe the calculation of the correlation and shift using FFT. First, consider two functions, r(x) and s(x), at any shift position, u; their cross-correlation function, denoted Corr(r,s)u, is defined by

Corr(r,s)u )



∞ -∞r(x)s(x

+ u) dx

(1)

Further, the Fourier transform for any function h(x) is given by

H(χ) ) h(x) )





∞ 2πiχx -∞h(x)e

dx

∞ -2πiχx dχ -∞H(χ)e

(2)

where H(χ) is the Fourier transformed function in the inverse wavelength χ domain. For simplicity, a forward and reverse Fourier transform is typically denoted as

h(x) S H(χ)

(3)

Now, the discrete correlation theorem states that

Corr(r,s)u S R(χ)S*(χ)

(4)

where S* represents the complex conjugate of the function. Therefore, by applying forward Fourier transforms to r(x) and s(x), then multiplying the transformed functions R(χ) and S*(χ), and then finally performing reverse Fourier transform of this product will generate a cross-correlation function between r(x) and s(x). The optimal shift, uop, between r(x) and s(x) can be found at the maximums of Corr(r,s)u. If r(x) ) s(x), then the maximal correlation will be at uop ) 0. FFT Correlation Spectral Alignment Algorithm. One of the inherent difficulties with the alignment of spectral data is that the shifts in chemical signals are not necessarily uniform across the spectrum. As a result, it is necessary to divide the spectrum into an arbitrary number of segments such that the shift in each signal can be corrected independently. Forshed et al.11 proposed a model in which a spectrum is segmented sequentially according a minimum segment size parameter. Lee and Wooduff12 made use of this segmentation method and applied a beam search algorithm for the determination of the shift between the two spectral segments that gives the greatest correlation using Pearson’s (17) Eng, J. K.; McCormack, A. L.; Yates, J. R. J. Am. Soc. Mass Spectrom. 1994, 5, 976.

correlation coefficient. The beam search is a heuristic search method based on the breath-first search. However, rather than calculating the correlation coefficient at all shift positions, as would be the case using breath-first search, beam search only selects the best k shift positions (beams) to evaluate at each search level. While this search method would find the candidate optimal shift position more quickly than a breath-first search, it leaves the possibility for returning a suboptimal solution. Our first contribution makes use of the same spectra segmentation model, but in order to improve the speed and accuracy of the algorithm, we apply FFT cross-correlation (4) for the shift determination. For simplicity we will refer to our algorithm as peak alignment by FFT (PAFFT). Our second alignment algorithm is also based on spectral segmentation; however it has been developed with the aim of eliminating the requirement of parameters by automatically determining the minimal segment size for alignment. This is achieved by recursive alignment from the full spectrum (global scale) to progressively smaller segments (local scale) until no further alignment is required. The main advantage of our approach is that, by first aligning spectra on a global scale, the amount of alignment required on a local scale can be reduced. At the same time, the algorithm will automatically locate and align local areas that require particular attention. The algorithm can be described in four separate functions. The first step calculates the cross-correlation function between a segment and its corresponding reference segment as describe in (4) using FFT. The optimal shift position, uop, is then found by

uop ) maxu(Corr(r,s)u)

(5)

Once the optimal shift size is found by using (5), the segment is then shifted by that amount using

s′(x) )

{

s(0) x∈[0,uop - 1], s(x) x∈[uop,N - uop - 1] if uop > 0 s(x) x∈[uop,N - 1], s(N) x∈[N,N + uop - 1] if uop < 0 if uop ) 0 s(x) x∈[0,N - 1] (6)

where N is the size of the segment and s′(x) is the shifted segment. To enable a recursive alignment approach of progressing from alignment of larger segments to smaller segments, the next process is to divide a segment and its corresponding reference segment into two subsegments at position, p, such that

p ) min(s(x)) x∈[N/4, 3N/4]

(7)

the divided segments, lefts and rights can then be defined as

lefts ) s(x) x∈[0, p - 1] rights ) s(x) x∈[ p, N - 1]

(8)

the corresponding reference segments, leftr and rightr are computed similarly.

Figure 1. PAFFT algorithm in pseudocode.

The recursive alignment function, R, can now be expressed as follows

R(s(x),r(x)) )

{

if N < 10 or uop(s) ) 0 s(x) [R(lefts,leftr) ∼ R(rights,rightr)] if N g 10 and uop(s) * 0 (9)

where lefts, leftr, rights, and rightr are calculated from s(x) by applying eqs 5-8 sequentially. Note that the symbol “∼” denotes the concatenation of the segments. The recursion is set to stop when the segment size falls below 10 points to prevent the algorithm from unnecessarily aligning small segments of noise. We will refer to the above algorithm as recursive alignment by FFT (RAFFT). The pseudocodes for PAFFT and RAFFT are provided in Figure 1 and Figure 2, respectively. Implementation. The algorithms were implemented in both Matlab and C++ and are available from our SpecAlign website (http://ptcl.chem.ox.ac.uk/∼jwong/specalign) under the downloads section. There are a number of optional parameters that have been made available to the user to allow a degree of control of the algorithm. First, while the cross-correlation function (4) spans the range of the functions r and s, it is not necessarily practical to accept shifts that are overly large. In situations where the spectrum to be aligned has reasonable resemblance to the reference spectrum, the problem of overshifting should not occur. Nevertheless, a parameter has been made available to restrict the maximum shift allowed for each segment. Second, as the recursive alignment algorithm can be halted prematurely if a spectrum is globally well aligned, a tolerance parameter is also made available to enable the recursion to “look ahead” for local alignment abnormalities. Illustrative Examples. The algorithms were tested using two different data sets. The first example demonstrates the alignment algorithms on a GC data set. The data set consist of 36 samples, Analytical Chemistry, Vol. 77, No. 17, September 1, 2005

5657

reference spectrum and the sample spectra before and after alignment was compared. This was performed by selecting the 10 highest intensity peaks in each sample spectrum and comparing their peak maximums to the corresponding peak in the reference spectrum. If none was found within a range of (50 data points, the peak was deemed to have no corresponding peak and was disregarded. The advantage of using relative peak shift as a measure of the quality of alignment in comparison to using correlation is that result will not be dependent on peak size.

Figure 2. RAFFT algorithm in pseudocode.

each with 9451 data points. The average spectrum of the 36 samples was used as the reference for alignment in all cases. For the second example, a more challenging (more peaks) and larger SELDI-MS data set was used to test the algorithms. The data set consist of 256 samples, each containing 17 669 data points. Again, the average spectrum of all 256 samples was used as the reference (Figure 3). Matlab implementations of PAFFT and RAFFT algorithms were used to align the data sets. In addition, the Matlab implementation of PABS was also tested as a comparison. All runs were carried out on a standard personal computer (CPU-1Ghz, RAM-256Mb). Unless otherwise specified, as parameters, both PAFFT and PABS were set to have a minimum segment size of 500, and all algorithms were set to enable a maximum shift of no more than (20 points. Both parameters were predetermined by optimizing values that maximize the alignment of the data set. The CPU time required for the alignment of the data set was measured for each method using the “cputime” function Matlab. As it is difficult to qualitatively show the level of improvement in alignment across the spectrum for all samples, the performances of the algorithms are demonstrated quantitatively. The average overall improvement in shift for the peak maximums in the

RESULTS We first demonstrate the effect of an alignment algorithm qualitatively. Figure 4 demonstrates the alignment of a particular set of peaks in a sample of the GC data set. Before alignment (A), it can be seen that peaks are not synchronized. Following alignment (B), the algorithm appropriately shifts the peaks to match the reference. The alignment of the GC data set was quantitatively measured by the average peak shift from the reference chromatogram (Table 1). It can be seen that, as expected, all alignment methods significantly reduced the average shift of peaks for the data set compared to the reference. The improvements were broadly similar, with PAFFT resulting in the best improvement in peak alignment. PABS performed marginally better than RAFFT. In terms of run time, PAFFT was the fastest followed by RAFFT and PABS. In alignment of the SELDI-MS data set, the PAFFT algorithm again performed the best (Table 1). As with the GC data set, the improvement in average shift was similar for the three methods, with RAFFT performing better than PABS. The run times show that the FFT cross-correlation based algorithms are fastest and are particularly effective in the alignment of larger data sets. To demonstrate the importance of selecting an appropriate segment size parameter for the PABS and PAFFT algorithms, the alignment was run on the SELDI-MS data set using varying segment sizes. Figure 5 shows that the minimum segment size that led to the greatest improvement in peak shift following alignment is 500 points. With segment sizes smaller and larger than 500 points, the improvement decreases. So, as expected, the choice of minimum segment size is significant and influences the accuracy of alignment. Further, with smaller segment sizes, there is an increased risk of introducing alignment artifacts (i.e., peak

Figure 3. Average spectrum of the SELDI-MS data sets. The spectrum is an average of 256 spectra each containing 17 669 data points. 5658

Analytical Chemistry, Vol. 77, No. 17, September 1, 2005

Figure 4. GC data set before (A) and after (B) the application of the RAFFT alignment algorithm. In both figures, the boldface line indicates the average spectrum used as the reference for the alignment. The inset figure shows the full chromatograph where the shaded area highlights the region zoomed in.

shifts that are not necessarily correct) (Figure 6). It can also be seen from Figure 6, that, using RAFFT, no artifact was introduced for the same segment. DISCUSSION The most significant benefit of using FFT cross-correlation for the alignment of large spectral data sets is in the improvement in

speed over existing methods. The results presented here (Table 1) indicated that the improvement in the running time of alignment using PAFFT over PABS is considerable (∼5 times), despite the fact that these algorithms only differ by the optimal shift position calculation step. While, for each segment of size N, the computational complexity in segment shift determination is in fact identical by FFT cross-correlation and beam search (when one Analytical Chemistry, Vol. 77, No. 17, September 1, 2005

5659

Table 1. Results for the Average Shift of the Ten Most Intense Peaks in Each Spectrum Compared to the Reference before Alignment and after Alignment Using Each Algorithm for the GC and SELDI-MS Data Setsa GC data set algorithm none RAFFT PAFFT PABS

average shift (pts)

CPU time (s)

6.831 ( 3.021 3.764 ( 1.587 4.88 ( 0.10 3.239 ( 1.354 1.59 ( 0.05 3.714 ( 1.442 8.10 ( 0.12

SELDI-MS data set average shift (pts)

CPU time (s)

5.625 ( 2.603 2.656 ( 1.530 71.52 ( 0.91 2.522 ( 1.482 22.89 ( 0.46 2.709 ( 1.575 109.76 ( 1.65

a The CPU times are an average of three runs as measured in Matlab.

Figure 6. Segment of a sample of the SELDI-MS data set showing an alignment artifact introduced in the process of alignment by the PAFFT algorithm using a minimum segment size of 200 points. The lines are as follows: dark bold, reference; solid, unaligned; dotted, alignment by PAFFT; dashed, alignment by RAFFT.

Figure 5. Average peak shift improvement after alignment as a function of minimum segment size on the SELDI-MS data set. The points marked by diamonds indicate the performance of alignment by the PAFFT algorithm at varying minimum segment sizes. The broken line shows the performance of the alignment by RAFFT. Note that the RAFFT algorithm runs independent of the minimum segment size.

beam is used) at O(N log2N). Yet, as reflected in results in Table 1, FFT cross-correlation is significantly faster than beam search, because in practice, the calculation of Pearson’s correlation coefficient used in beam search has greater computational overhead due to the requirement to compute divisions and square roots. The effect of this overhead becomes increasingly significant with large values of N and the number of spectra to be aligned. RAFFT is based on a recursive segmentation method and therefore requires the calculation of cross-correlation for larger segments of spectra initially; consequently, it is slower than PAFFT even though examination of the number of executions of the FFT cross-correlation function reveals that RAFFT attempts to align fewer segments compared with PAFFT when using a minimum segment size of 500 points. Despite this, RAFFT still shows a speed advantage over PABS, again demonstrating the speed advantage of FFT. A second benefit of FFT cross-correlation is that, unlike the beam search algorithm, it finds the optimal spectral shift from the calculation of the complete cross-correlation function. Beam search is a heuristic method, and although in most cases is also able to find the optimal solution, it may sometimes return a 5660 Analytical Chemistry, Vol. 77, No. 17, September 1, 2005

suboptimal result. This is the likely explanation for PAFFT returning better average peak shift improvements than PABS (Table 1). Based on our quantitative measure of alignment, it might be initially slightly surprising to see that, following alignment, the average peak shift discrepancy from the reference is still between two and three data points. The algorithm alignments used here represent segment correlation methods rather than peak alignment methods based on aligning peak maximums. As correlation is dependent on peak shape, in most cases, even following alignment, it is possible that the peak maximum does not occur on the same data point as the reference. RAFFT makes use of a recursive segmentation model to eliminate the need to estimate a minimal segmentation size before performing alignment. Moreover, it was demonstrated that the choice of minimum segment size is not trivial (Figure 5) and choosing too small a size may result in the introduction of alignment artifacts (Figure 6). Starting from aligning a spectrum from a global scale, the recursive segmentation model is able to automatically choose local segments that require further alignment. However, it is possible under recursive alignment that a spectrum may be well aligned globally and yet contains small local segments that are unaligned. In this case, the recursive method may cease prematurely. To tackle this potential problem, we provide an optional parameter in our implementation that enables the recursive algorithm to “look ahead” a specified number of iterations for local segments that may not be well aligned. As mentioned earlier a number of other alignment algorithms exist. However, warping methods such as dynamic time warping, correlation time warping, and parametric time warping are not directly comparable to the methods described here because they are unsuitable to align data sets as large as those tested here due to memory and time requirements. Another class of alignment algorithms, peak alignment-based algorithms, are more rapid as they align only peaks, not all data points. Peak-based methods should be more accurate than correlation-based methods if the proceeding data analysis method is peak intensity based. However,

peak-based methods are unable to correctly align dense peak regions where peaks in the sample may overlap with those in the reference. Because the FFT cross-correlation methods proposed here are not based on peak information, they may be used as a prealignment step to make initial corrections to alignment before applying a peak-based alignment method. We provide this option in SpecAlign, a spectral data preprocessing software tool developed in our laboratory.15 In this paper, we have not provided results for direct comparison between peak-based methods with the FFT methods outlined in this paper. However, we note that, from test results not shown here, in terms of speed, the FFT crosscorrelation based algorithms are broadly comparable with the peak-based algorithm implemented in SpecAlign. CONCLUSION The run-time advantage of FFT for the calculation of the crosscorrelation function has been exploited in the development of PAFFT and RAFFT algorithms for the alignment of large spectral data sets. The examples presented here demonstrate that the

methods have considerable advantages over the existing PABS method. PAFFT resulted in marginally better alignment and a significant improvement in speed. Meanwhile, the RAFFT algorithm provides the benefit of eliminating the need for user selection of a minimum segment size before alignment. These methods are anticipated to address the need for automated, fast, and reliable data preprocessing methods that are becoming increasingly important in the high-throughput analysis of large spectral data sets in proteomics experiments. ACKNOWLEDGMENT Financial support from the Sir Vincent Fairfax Oxfords Australia Scholarship and the Overseas Research Scheme (U.K.) to J.W.H.W is gratefully acknowledged.

Received for review April 12, 2005. Accepted July 6, 2005. AC050619P

Analytical Chemistry, Vol. 77, No. 17, September 1, 2005

5661