MetAlign: Interface-Driven, Versatile Metabolomics Tool for

Mar 20, 2009 - RIKILT-Institute of Food Safety, Wageningen UR, P.O. Box 230, 6700 AE .... Table 1. Indications for Data Processing Benchmarks .... The...
0 downloads 0 Views 858KB Size
Anal. Chem. 2009, 81, 3079–3086

MetAlign: Interface-Driven, Versatile Metabolomics Tool for Hyphenated Full-Scan Mass Spectrometry Data Preprocessing Arjen Lommen* RIKILT-Institute of Food Safety, Wageningen UR, P.O. Box 230, 6700 AE Wageningen, The Netherlands Hyphenated full-scan MS technology creates large amounts of data. A versatile easy to handle automation tool aiding in the data analysis is very important in handling such a data stream. MetAlign softwaresas described in this manuscriptshandles a broad range of accurate mass and nominal mass GC/MS and LC/MS data. It is capable of automatic format conversions, accurate mass calculations, baseline corrections, peak-picking, saturation and masspeak artifact filtering, as well as alignment of up to 1000 data sets. A 100 to 1000-fold data reduction is achieved. MetAlign software output is compatible with most multivariate statistics programs. One of the key issues in MS (Mass Spectrometry)-based metabolomics is the analysis of the enormous amount of data. For this to be feasible a high degree of automation in data analysis is necessary. A number of tools for preprocessing, alignment and identification have been proposed in the literature, such as MzMine,1 Binbase,2 XCMS,3 MathDAMP,4 MetaQuant,5 METIDEA,6 Tagfinder.7 Besides these tools, MS manufacturers have been making commercial tools available, such as for example Markerlynx (Waters) and Sieve (ThermoFischer Scientific). All have shown to be powerful tools. No application, however, can combine in automation the following: (a) easy use of both GC/ MS (Gas Chromatography Mass Spectrometry) and LC/MS (Liquid Chromatography Mass Spectrometry) data, (b) direct conversion to and from manufacturer formats, as well as netCDF (network Common Data Form), (c) preprocessing (baseline correction, denoising, accurate mass calculation) and export of the result to manufacturer formats for visual inspection, (d) alignment at low mass and high mass resolutions, (e) export of univariate statistical selections to differential MS data files, (f) export to spreadsheets for multivariate statistical analysis, and (g) * To whom correspondence should be addressed. E-mail: arjen.lommen@ wur.nl. (1) Katajamaa, M.; Oresic, M. BMC Bioinf. 2005, 6, 179–91. (2) Fiehn, O.; Wohlgemuth, G.; Scholz, M. Proc. Lect. Notes Bioinf. 2005, 3615, 224–239. (3) Smith, C. A.; Want, E. J.; O’Maille, G.; Abagyan, R.; Siuzdak, G. Anal. Chem. 2006, 78, 779–787. (4) Baran, R.; Kochi, H.; Saito, N.; Suematsu, M.; Soga, T.; Nishioka, T.; Robert, M.; Tomita, M. BMC Bioinf. 2006, 7, 530. (5) Bunk, B.; Kucklick, M.; Jonas, R.; Mu ¨ nch, R.; Schobert, M.; Jahn, D.; Hiller, K. Bioinformatics 2006, 2, 2962–2965. (6) Broeckling, C. D.; Reddy, I. R.; Duran, A. L.; Zhao, X.; Sumner, L. W. Anal. Chem. 2006, 78, 4334–4341. (7) Luedemann, A.; Strassburg, K.; Erban, A.; Kopka, J. Bioinformatics 2008, 24, 732–737. 10.1021/ac900036d CCC: $40.75  2009 American Chemical Society Published on Web 03/20/2009

conversion of a multivariate statistical selection to a MS data file. The newest metAlign version introduced here combines these aspects. The metAlign software has been in development for over 8 years and has been used in several publications. The basis of nearly all metAlign algorithms is derived from the way a trained expert would analyze the data by eye and hand. This implies that no major published mathematical algorithms are used. A dilemma in software development and publication is that fast publication of new complex software does not always mean that the software is fully tested. To be able to validate software with broad capabilities such as metAlign, several sets of data files from different origins are needed. In the past, the choice was made to first validate the software with a number of applications, which led to publications, before publication of the software itself. The first version of metAlign was adapted from a NMR (Nuclear Magnetic Resonance) preprocessing and alignment tool developed more than 10 years ago.8,9 The adaptation toward LC/MS and GC/MS data resulted in a first publication in 2003 and from then onward.10-18 This version of metAlign could work with GC/MS as well as LC/MS data through conversion to nominal mass data. Starting in 2007 a new more elaborate metAlign version was developed, which can calculate and fully use the accurate mass in data reduction and for alignment. This manuscript describes the algorithms behind metAlign. A free download of the software, manual, and additional tips are provided at www.metalign.nl. (8) Lommen, A.; Weseman, J. M.; Smith, G. O.; Noteborn, H. P. J. M. Biodegradation 1998, 9, 513–525. (9) Noteborn, H. P. J. M.; Lommen, A.; van der Jagt, R. C.; Weseman, J. M. J. Biotechnol. 2000, 77, 103–114. (10) Tolstikov, V. V.; Lommen, A.; Nakanishi, K.; Tanaka, N.; Fiehn, O. Anal. Chem. 2003, 75, 6737–6740. (11) Vorst, O.; de Vos, C. H. R.; Lommen, A.; Staps, R. V.; Visser, R. G. F.; Bino, R. J.; Hall, R. D. Metabolomics 2005, 1, 169–180. (12) Tikunov, Y.; Lommen, A.; de Vos, C. H. R.; Verhoeven, H. A.; Bino, R. J.; Hall, R. D.; Lindhout, P.; Bovy, A. G. Plant Physiol. 2005, 139, 1125–1137. (13) America, A. H. P.; Cordewener, J. H. G.; Van Geffen, H. A.; Lommen, A.; Vissers, J. P. C.; Bino, R. J.; Hall, R. D. Proteomics 2006, 6, 641–653. (14) Keurentjes, J. J. B.; Jingyuan, F.; de Vos, C. H. R.; Lommen, A.; Hall, R. D.; Bino, R. J.; van der Plas, L. H. W.; Jansen, R. C.; Vreugdenhil, D.; Koornneef, M. Nat. Genet. 2006, 38, 842–849. (15) Lommen, A.; van der Weg, G.; van Engelen, M. C.; Bor, G.; Hoogenboom, L. A. P.; Nielen, M. W. F. Anal. Chim. Acta 2007, 584, 43–49. (16) de Vos, C. H. R.; Moco, S.; Lommen, A.; Keurentjes, J. J. B.; Bino, R. J.; Hall, R. D. Nat. Protoc. 2007, 2, 778–791, Nature Protocols. (17) Ducruix, C.; Vailhen, D.; Werner, E.; Fievet, J. B.; Bourguignon, J.; Tabet, J.-C.; Ezan, E.; Junot, C. Chemom. Intell. Lab. Syst. 2008, 91, 67–67. (18) Matsuda, F.; Yonekura-Sakakibara, K.; Niida, R.; Kuromori, T.; Shinozaki, K.; Saito, K. Plant J. 2009, 57, 555–577.

Analytical Chemistry, Vol. 81, No. 8, April 15, 2009

3079

Table 1. Indications for Data Processing Benchmarksa chromatography type type of MS vendor mass range mass resolution number of scans

GC HP-MSD HP/Agil. 70-450 nominal 1963

GC TOF Leco 70-600 nominal 36880

GC TOF Leco 70-600 nominal 37000

UPLC TOF Waters 100-1500 10000 1050

UPLC TOF Waters 80-1500 10000 2600

HPLC Orbitrap Thermo-F. 50-1000 100000 1430

number of samples mean original file size mean reduced file size

8 1.6 MB 106 kB

12 114 MB 950 kB

940 117 MB 940 kB

23 162 MB 160 kB

550 205 MB 180 kB

8 51 MB 260 kB

baseline correction rough alignment iterative alignment

70 s 5s 20 s

Running Times Per Batch 52 min 66 h 50 s n.d. 256 s 91 h

159 min 176 s 10 min

70 h n.d. 92 h

75 min 92 s 7 min

a All examples were run on an Intel P4 3 GHz processor with 1.5 Gb of memory at thresholds of approximately 2 times noise using Figure 1. Running times were for the number of files indicated by the number of samples. Baseline corrections were including format conversions, smoothing, denoising, filtering, and peak-picking. Rough and iterative alignments are as indicated in the text.

EXPERIMENTAL SECTION Data files used for the development of metAlign were from different sources and used with permission as examples. The scientific relevance of the data files is beyond the scope of this manuscript. Conversion of Masslynx format, Xcalibur format, netCDF, and the old-style HP/Agilent format is implemented in metAlign. The first two formats require installation of the manufacturer’s software. The Masslynx format is accessed by conversion to netCDF by the in-line use of the Masslynx conversion tool, Dbridge. Xcalibur format is converted to netCDF using the OCX provided with Xcalibur ensuring selection of scans on the basis of scan filters; conversion of netCDF to Xcalibur format is done in-line using the Xcalibur conversion tool, Xconvert. GC/ MS data (nominal mass) from HP/Agilent, Leco, Thermo Fisher and GC/MS (accurate mass) from Waters were tested with success. LC/MS data (accurate mass) from Waters (i.e., UPLC(Q)TOF and HPLC-(Q)TOF (with and without Dynamic Range Extension)), Applied Biosystems (HPLC-QTOF), Bruker (UPLCMicroTOF), Thermo Fisher (HPLC-Orbitrap at resolution 30000 to 100000) were tested with success. (UPLC ) Ultra Performance Liquid Chromatography; HPLC ) High Performance Liquid Chromatography; TOF ) Time of Flight). The maximum theoretical number of data files which can be pre-processed and aligned in one session is 1000. A batch of 940 Leco GC-TOF (nominal mass data) files was successfully aligned, as well as 550 Waters UPLC-TOF (accurate mass data) files. The time needed to process data lies between minutes and several days depending on the PC configuration, number of files, and mass resolution (see Table 1). These two examples took approximately a week. Data reduction is typically between a factor 100 and 1000 prior to alignment. All modules of metAlign are written in C and Visual C++ version 6.0. A schematic overview is given in Figure 1. Installation requirements for metAlign are as follows: (a) Operating system: Windows XP, Windows NT, or Windows 2000. (b) At least 1 Gb of internal memory (SDRAM or better). MetAlign is written in such a way that it nearly never exceeds the use of 500 Mb. It is recommended to exit any other memory consuming programs during execution. (c) Free disk space of 80 Gb is recommended to ensure that no disk space problems arise during alignment. 3080

Analytical Chemistry, Vol. 81, No. 8, April 15, 2009

(d) The metAlign program should be run at a screen resolution of 1024 × 768 with small fonts or at higher resolution with large fonts. (e) To install and run metAlign you must have administrator rights. Aligned data, exported as csv files by metAlign and therefore compatible with most multivariate statistical programs, were imported into GeneMaths XT (http://www.applied-maths.com/ genemaths/genemaths.htm) for multivariate statistical analysis. RESULTS AND DISCUSSION Baseline Correction, Smoothing, and Denoising. Typical noise in a mass trace is governed by a combination of the regular detector noise and chemical noise. Detector noise is more or less constant within a mass trace but can be mass dependent. Chemical noise is due to column bleeding and contaminating components in the chromatography system. Chemical noise is typically concentration dependent and mass dependent and does not appear as a normal peak on a column; it is seen as a continuous noisy contribution to the baseline. Therefore noise can not be described as a constant, but must be analyzed per mass as a function of gradient/time. Another complicating factor can also be the use of thresholds in the storage of data; in this case all values below the threshold are not recorded to save disk space. Examples of noise and a threshold in mass traces are given in Figure 2. Nominal Mass Data (see Figure 1 N1-N3). During conversion of the MS data a user-specified binning is performed to obtain nominal data, if nominal data is not already present. In the same process the minimum amplitude in the file is sought out and is used as the threshold value for the whole data set. If necessary, the program starts by temporarily filling in the gaps in the mass traces, which were created by the use of a threshold. This is done by interpolating between the two points immediately adjacent to a gap. As one of the parameters the user must specify the end of the chromatogram via a scan in which the gradientsand therefore chemical noisesis maximal (see Figure 2). For each mass trace the last 1% of the chromatogram with a minimum of 30 scans is analyzed by looking for a maximum difference in amplitude between adjacent scans. The maximum difference is taken to be a measure for chemical plus detector noise. Furthermore, the absolute value of the baseline amplitude is stored; this is taken

Figure 1. Schematic overview of metAlign. The top half-consists of the steps for the data reduction. The bottom half-consists of the steps for data alignment. Bold letters indicate additional steps when an iterative alignment is chosen.

to be a measure for the concentration of the system-related compound creating the chemical noise. Normally the last 1% of the chromatogram should be empty in terms of compounds. Therefore, a check is performed to see if the noise values (maximum difference and absolute amplitudes) obtained are mass-peak related or baseline related; if the former is the case another value is taken from this region.

The determined descriptors for noise (i.e., maximum difference and absolute baseline amplitudes) are then used to calculate a noise criterion as a function of mass and time. For this a small window is moved through a mass trace to obtain local minimum amplitude estimates in the time dimension. Local minimum amplitude is supposed to relate linearly to the concentration of chemical noise. Thus by scaling down the Analytical Chemistry, Vol. 81, No. 8, April 15, 2009

3081

Figure 2. GC/MS data examples of baselines of mass traces (time in minutes) corrected by metAlign in automation. (A) Original trace of mass 208: a threshold is observed at the beginning of the chromatogram; chemical noise due to column bleeding is observed toward the end. (B) The trace in A after baseline correction, smoothing, denoising and peak-picking. (C) Original crowded trace of mass 71. (D) The trace in C after baseline correction, smoothing, denoising and peak-picking.

chemical plus detector noise value obtained at the end of the chromatogram a local noise value is obtained. The local noise value can, however, not decrease below half the threshold value of the data set. For all mass traces the local noise values are thus estimated as a function of time. Once the gap-filling and noise estimation is performed, a standard binomial digital filter is executed over the amplitude as well as the derived noise data. The binomial distribution is set by the user defined average peak width at half-height in scans. This results in a smoothed data set and removes effectively the possibility of noisy mass peaks leading to more than one maximum during peak-picking. Often in metabolomics overloading of samples occurs. This results in detector saturation. Since the detector is not able to digitize the top of a saturated signal a lot of noise is often seen at its top. Because of saturation, the peak is also nearly always broad in appearance. This can lead to multiple maxima when a peakpicking is done, even though smoothing is applied. Therefore during the smoothing process a user-defined maximum amplitude threshold is applied; for saturated mass peaks an artificial maximum is constructed at this threshold value. To discriminate between baseline noise and mass peaks the estimated noise matrix is used. As default (user definition), the local noise value is used to search for possible mass peaks. If two consecutive points in a mass trace increase more than the local noise value going from left to right or from right to left in the mass trace the possibility of a mass peak occurring is noted. The program will add other possible adjacent points participating in the potential mass peak by loosening local peak finding criteria. All points in a mass trace noted as potential signal are placed in “peak regions”, while all points not noted are placed in “baseline regions”. In effect numerous regions are thus defined within each mass trace. The baseline correction algorithm consists of a series of linear baseline corrections utilizing the first and last points of 3082

Analytical Chemistry, Vol. 81, No. 8, April 15, 2009

a “peak region”. “Baseline regions” are set to zero. The peak detection algorithm and baseline correction is then performed a second time on the baseline-corrected data using half times the previously used noise values. Finally, a threshold is applied to eliminate residual small noise peaks, which survived the baseline correction. Which ever is higher, the threshold is either a “x times local noise value” or a user defined minimum value. The corrected files are stored as MS data files. A choice can be made for peak picking (default). Non-peak-picked data can still be used for deconvolution in third party software but can not be used for alignment in metAlign (see as example Figure 3C). Peak-picked files are typically 100- to 1000-fold smaller than the original data and are used in subsequent alignment procedures. Examples of the performance are given in Figure 2. Special Treatment of Leco GC-TOF Data (See Figure 1 N2). Leco GC-TOF can be exported to netCDF. Leco data (from four different Leco GC-TOF systems) have an inherent peculiarity, which to the author’s knowledge does not occur in other types of data. The amplitude offset (“zero”-value) in the mass dimension is varying per scan. This is observed in Figure 3A as an irregular baseline oscillation. This means, that Leco data needs a baseline correction in the mass dimension before a baseline correction in the time dimension. The result of a baseline correction in the mass dimension is shown in Figure 3B; several “ghost”-peaks disappear while other previously obscured peaks start to appear in the TIC (Total Ion Current). In Figure 3C a further baseline correction in the time dimension has been performed rendering the result prior to a peak pick. Accurate Mass Data (See Figure 1 A1-A4). In most nominal mass MS measurements the actual binary data for mass is already nominal and therefore binned (bin ) 1 mass unit) before storage. In accurate mass spectrometry (in centroid mode) the mass over the scans of a mass peak is not constant, but varies, and no binning has been performed. This variation in mass is

Figure 3. Example of special treatment of LECO GC-TOF-MS data. A. Part of an original TIC (Total Ion Current) chromatogram after a Leco baseline correction. B. Same as A after baseline correction in the mass dimension. C. Same as B after baseline correction in the time dimension.

controlled by the limits of detection, such as noise, saturation, and mass resolution, in combination with the centroiding function used. Using a large bin (compared to mass resolution) for accurate mass data can result in adding more than one mass peak together as well as adding noise, which also maybe present in the bin, to the mass peak. On the other hand a too small bin could result in incomplete mass peaks. Incomplete mass peaks can also occur if the mass is on the edge of a bin. Therefore, an optimal bin for an accurate mass peak should be defined by the mass resolution, while centering the mass in this bin. To do this for each individual mass occurring in a data set would mean correcting endless numbers of mass (bin) traces. In addition each mass could be present in numerous traces. This would complicate the processing and require more extensive computing, which is not desirable. The strategy chosen here is a partial compromise at the cost of some more computing time. Because an accurate mass is dependent on an amplitude range within which the mass is more or less constant, the calculation of the accurate mass must be performed before a baseline correction is done; a baseline correction could after all alter the amplitude range. Accurate mass calculation is therefore done during the conversion of the original data. The user defines mass resolution and the amplitude range in which the data are thought to be appropriate for accurate mass calculation. During conversion of the original data, continuous stretches of (scan,mass,amplitude) points within mass resolution are composed. These stretches are searched for features in which the amplitude must first be going up and then down; these stretches can be real peaks, but could still at this point also be, for instance, noise or baseline. The average masses of these features of (scan, mass,amplitude) points are then calculated for all points within the given amplitude range. These average mass values of features are stored separately. If no values within a feature are found within the amplitude window, then the mass belonging to the closest amplitude is used.

Since no mass peaks as yet have been detected, smart choices of mass bins can not be done. Therefore a first crude approach is done at this stage to be able to localize mass peaks. Using original data, mass traces are constructed from bins, which each are (mass+- mass/(mass resolution)). The width of a bin is thus in the order of the width of a mass peak in the mass dimension prior to centroiding. Logically, such bins can not contain more than one mass value per scan, because mass resolving power is limited to mass resolution. The consecutive mass bins are chosen to overlap for 50%; these bins are not specifically centered on masses but run continuously over the total mass range. The baseline correction procedure described above for nominal mass is then performed on each mass trace constructed from the bins. The separately stored average mass values are then transferred to the peaks found in the baseline corrected and denoised mass traces. Mass peaks tend to occur twice because of the overlapping bins; only the highest amplitude is then retained. Because the bins, used above, could distort the baseline correction slightly, because of lack of centering of bins on mass peaks, the first crude approach is not taken as the final one. Construction of correct mass bins is done as follows. All masses from the data set are projected into the mass dimension to give a mass profile. This mass profile is then analyzed to create a new list of mass bins. Within temporary bins of 1x mass/(mass resolution) a weighting on the basis of amplitudes takes place for the mass peaks present. This weighting determines where the center of a final bin of 2× mass/(mass resolution) will be placed. Bin formation then moves on to the next set of masses within the next 1× mass/(mass resolution). All masses are thus placed at least one time within the central 50% of a bin; the bin is wide enough to account for mass fluctuation within a mass peak. Therefore the preprocessing of each mass peak should be done correctly at least one time. Less optimal bins will lead to smaller amplitudes. The separately stored average mass values are then Analytical Chemistry, Vol. 81, No. 8, April 15, 2009

3083

Figure 4. For accurate mass data, user adjustable and mass-peak amplitude related filters as depicted in red (rectangle and triangle) are applied after baseline correction. All elevated noise peaks within the filter windows are deleted. (A) Example of part of a mass trace originating from an Orbitrap (at resolution 100000). (B) Same as A after deletion of the signals within the filters. (C) Example of part of a mass trace originating from a TOF (at resolution 10000). (D) Same as C after deletion of the signals within the filters.

again transferred to the peaks found in the final baseline corrected and denoised mass traces. Again mass peaks can occur more than once. Only the highest amplitude for each mass peak is retained. An additional problem to overcome with accurate mass data is shown in Figure 4. To the best knowledge of the author, all centroid accurate mass data from different manufacturers show a large number of significantly elevated noise mass peaks around high-amplitude mass peaks. Although significant, the elevated noise is below 5% of mass peak amplitude. The origin of these noise mass peaks is not known for certain to the author; they do, however, seem to correlate to the height of the real mass peak and could be related to the centroiding algorithms used. In concentrated samples the total number of mass peaks found can easily increase multifold compared to the real number of masses. Therefore, for accurate mass data, filters as depicted in Figure 4 are applied after the first crude and second final baseline correction. All “mass peaks” within the filter window are deleted. Nominal Mass Data Alignment (See Figure 1 N4-N6). For alignment, baseline corrected, denoised, and peak-picked data are used. The software allows for two alignment modes, respectively rough and iterative. In the rough mode, a user defined time window runs through the time dimension of all data sets to be aligned. Per mass trace and within the time window, mass peaks are grouped on the basis 3084

Analytical Chemistry, Vol. 81, No. 8, April 15, 2009

of amplitude; this starts with the largest and ends with the smallest amplitude. The iterative mode uses the rough mode algorithm. Afterward, however, mass peaks present in all data sets are selected and used as landmarks. For each time point in the middle of the moving time window a difference in retention/scans is calculated on the basis of a minimum number of landmark masses present with certain amplitude in the whole time window. The average time dimension difference for each data set is calculated with regard to the first data set. The resulting retention/scan difference profiles are then used as a first correction estimate in the next alignment cycle. In the next cycle the moving time window is smaller. Also the minimum number of landmark masses and their amplitude are lowered. A new time dimension difference profile is calculated for each data set. The iterative cycle continues until the moving time window is in the order of a mass peak width. In principle the rough alignment mode will always work. The choice for this option would be for empty data sets or for very different data sets (i.e., not many landmarks). Using the rough alignment mode on crowded data sets will as a consequence have a number of swapped mass peaks; the number of swapped peaks will depend on the time window size. The iterative alignment mode works well with crowded data sets having a reasonable degree of

Figure 5. Example of a conversion of a selection of mass loadings into LC/MS-data. (A) PCA of four broccoli extract samples after metAlign processing followed by ANOVA (p < 0.01) preselection; yellow, red, and green spots are respectively analytical replicates of broccoli samples 1, 2, and 3; light blue ) analytical replicates of a mixture of the three broccoli samples. (B) Selection of mass peak loadings (blue ) outer 30%) underlying the separation in the PCA. (C) Mix sample (TIC) preprocessed by metAlign (vertical scale expanded 6 times). (D) Masses from B averaged for sample 1 (yellow) and converted to MS-data (TIC). (E) Same as D for sample 2 (red) (TIC). F same as D for sample 3 (green) (TIC).

similarity; this option will, however, fail for empty data sets because of lack of similarity (i.e., lack of landmarks). Accurate Mass Data Alignment (See Figure 1 A5-A6). All masses of all pre-processed data sets to be aligned are combined into one mass profile. Bins are then generated as described above for a final baseline correction for an accurate mass data set. The nominal mass data algorithms are then used. Nominal mass values are actually substituted by bin numbers, thus tricking the algorithms. After alignment accurate mass values are transferred back to the peaks in the bins. Since, here again, mass peaks can be present in more than one bin, a selection procedure must be used. The selection procedure is on the basis of the mass distribution within groupings and the completeness of a group. Completer groups are favored as well as smaller distribution of accurate mass values within a group. This last process together with the bin generation is the alignment in the accurate mass dimension. Export and Difference Analysis (See Figure 1 Bottom Part). Besides viewing the difference profiles in the time domain and looking up specific masses after alignment, the aligned data can also be exported to an Excel-compatible format. A user can choose his own preferred multivariate statistics program to analyze the Excel-data. If a multivariate-analysis derived subselection of mass peaks can be exported to an Excel-compatible format, a metAlign related tool is available to translate this to a supported MS-platform format. An example is given in Figure 5. Alternatively, if only two groups of data sets are to be considered, metAlign provides the possibility of a univariate selection of mass peaks being converted to a difference MS data set. Exporting data back to the original formats has the advantage

of expert visual inspection of peaks of interest, while keeping all the possibilities of post-processing provided by MS-platforms. Benchmarking metAlign. Potential users will want to benchmark metAlign with other software. Two papers 19,20 recently appeared in which a comparison of algorithms for finding peaks and aligning was done. Unfortunately the authors did not use the freely available metAlign software. Furthermore, they only used LC/MS data (no GC/MS) and used a data format, which is not compatible with most commercial MS-platforms (also not with metAlign). Table 1, however, should give a good indication on performance together with the figures in this paper. Although a good start19 with regard to benchmarking peakfinding was made, the following issues should additionally be addressed when comparing peak finding algorithms. For peak finding the different software packages 1-7 might use essentially different approaches, which all may have their own advantages and disadvantages depending on the type of data and the needs of a user. For instance, algorithms depending on nice peak shapes for detection will have problems with peak overlap, varying baselines, peak tailing, peak fronting, thresholds, ion suppression artifacts, and saturation artifacts and therefore miss information, but retain the obvious. A number of the just mentioned problems also can give problems with fitting algorithms such as the traditional deconvolution of peaks as generally done in GC/MS.20 For deconvolution overfitting or other erroneous fitting may occur because of irregular peak shapes; also signal-to-noise will determine how much overlap can be resolved. The metAlign algorithm, which takes the height of a peak after baseline correction, will (19) Tautenhahn, R.; Bo ¨tcher, C.; Neumann, S. BMC Bioinf. 2008, 9, 504. (20) Stein, S. E. J. Am.Soc. Mass Spectrom. 1999, 10, 770–781.

Analytical Chemistry, Vol. 81, No. 8, April 15, 2009

3085

not take overlap or peak area into account, but will avoid the other problems mentioned. Applying the metAlign algorithm for baseline correction, but without peak-picking, is known to improve subsequent traditional deconvolution. With regard to benchmarking peak-finding other general issues are as follows: (A) Is the algorithm equipped to deal with varying noise caused by chemical noise? (B) At what signal-to-noise level can the algorithm reliably operate? (C) How well is the software able to filter artifacts (see Figure 4)? (D) Can the algorithm perform a baseline correction in the mass dimension (see Figure 3)? (E) Can the algorithm handle GC/MS and LC/MS at both low and high mass resolution? The algorithm as described here for metAlign can cope with these issues and can run well at a signal-to-noise level of 2 to 3 in the presence of chemical noise. In the field of alignment there are now also a number of algorithms. As mentioned above, direct comparison with published methods21 is not possible at the moment. However, generally speaking, additional issues for benchmarking alignment should also be addressed: (A) How dissimilar may data sets be in composition and still succeed with alignment? (B) How many peak signals can be aligned at low and high mass resolution (what is the maximum mass resolution)? (C) How many data files can be aligned in one alignment? (D) How does alignment react to saturation of signals? (E) What kind of retention shifts can be handled? (F) How well is mass binning performed? As noted in the text, metAlign has two alignment (rough and iterative) options, which can handle up to 1000 highly dissimilar as well highly crowded similar data sets (limit >1000000 peaks) at both low and high resolution (mass resolution e 100.000). MetAlign can handle peak broadening due to saturation and can handle retention shifts related to temperature, as well as slight solvent and pH differences. MetAlign has shown its value in alignment in several high-ranking publications.10-18 For accurate mass data static mass binning is often used; the bins are then constant over the whole mass range. However, a static bin is essentially incorrect since the mass resolution dictates that mass precision is mass dependent. For accurate mass data metAlign uses dynamic binning related to the mass as dictated by the user defined mass resolution. The essence of software like metAlign is to handle a large amount of data and to streamline the essentials into a presentable format for further analysis. Therefore a maximal interaction with other software with which further analysis is possible is a prerequisite. Besides fast statistical analysis after alignment of all signals for general quality control (see Figure 5A) and untargeted analysis,16 statistics can also be used for statistics-driven identification. This has been implemented with success for GC/MS 7,12 in which statistical clustering is used on aligned data to reconstruct partial spectra of unique non-overlapping peaks. The highly reliable partial spectra together with the retention information lead to reliable identification.7,12 Additionally, knowing the unique nonoverlapping peaks gives nice opportunities to restrict errors in conventional or other deconvolution of other software packages. Similarly in LC/MS, in source fragmentation products and adducts can often be clustered to the parent ion18 to assist in identification. In LC/MS the accuracy of the mass (together with MSMS analysis) is often crucial in identification of unknowns, as well as (21) Lange, E.; Tautenhahn, R.; Neumann, S.; Gro ¨pl, C. BMC Bioinf. 2008, 9, 375–394.

3086

Analytical Chemistry, Vol. 81, No. 8, April 15, 2009

screening of known compounds. Therefore, the way the accurate mass is calculated is important. Each MS machine has its own amplitude window in which the mass is accurate. MetAlign is at present the only software that takes this into account in automation. On the issue of benchmarking other general things are as follows: (A) What are the data size limits? (B) How bug-free is the software? (C) How friendly is the software with regard to direct use, documentation, and visualization? (D) What data has the software been tested for and which format types can be handled? (E) How does the software interact with other software packages? An answer to most of this for metAlign has indirectly been given in the text. Looking back at the many developments made in processing metabolomics data it is clear that the informatics side is continuously evolving with the requirements of users in the field and with new separation and MS technology. Although adequate software will continue to play a major role in the metabolomics data stream, it is equally clear that the quality of data will keep on playing an essential role in future large scale metabolomics. CONCLUSION MetAlign has shown to be a powerful tool for data preprocessing of GC/MS as well as LC/MS based experiments. The preprocessing consists of automatic format conversions, accurate mass calculations, baseline corrections, peak-picking, saturation and mass-peak artifact filtering, as well as alignment of up to 1000 data sets. Because of the 100-1000 fold data reduction, future databasing of complete GC/MS and LC/MS profiles is possible. Identification software on the basis of the reduced metAlign output is under investigation. The metAlign software is easily installed on a standard computer running under Windows and is meant as a bridge between different commercial and freeware software platforms. A download of the software, manual, and additional tips are provided at www.metalign.nl free of charge. Regular updates will be provided in the future through this Web site. Source code will be made available under a Material Transfer Agreement on an individual basis. MetAlign has been integrated in Tagfinder.7 ACKNOWLEDGMENT This work was supported by the Dutch Ministry of Agriculture, Nature and Food Quality, Strategic Research Funds RIKILT-WUR (project 77232903), Statutory Research Tasks (theme 3): veterinary drugs (project 87203001), The Netherlands Toxicogenomics Centre (NTC), contract AIR3-CT94-2311 (European Commission (DG XII) and the EU-Framework VI programme: EU-METAPHOR (FP6: FOOD-CT-2006-036220), EU-NOFORISK (FP6: FOODCT2001-506387), EU-GMOCARE (QLK1-1999-00765). Data sets used in the development of metAlign are from these projects and research consortia. Ric de Vos and Yury Tikunov of Plant Research International (Centre of Biosystems Genomics) are thanked for critical evaluation using their own data in the validation process of metAlign. Furthermore, Emma Marsden-Edwards (Waters, Manchester) as well as Christophe Junot (CEA, Paris) are thanked for the use of respectively their UPLC-TOF and HPLC-Orbitrap data. Received for review January 7, 2009. Accepted February 28, 2009. AC900036D