TracMass 2—A Modular Suite of Tools for Processing

Mar 10, 2014 - In nontargeted analysis, the goal is to acquire signals from as many compounds as possible. For biological samples, for example, used i...
29 downloads 9 Views 1MB Size
Article pubs.acs.org/ac

TracMass 2A Modular Suite of Tools for Processing Chromatography-Full Scan Mass Spectrometry Data Erik Tengstrand,† Johan Lindberg,‡,§ and K. Magnus Åberg*,† †

Stockholm University, Department of Analytical Chemistry, SE-106 91 Stockholm, Sweden Analytical Proof Sweden AB, 115 37 Stockholm, Sweden § Global Safety Assessment, AstraZeneca R&D, 151 85 Södertälje, Sweden ‡

S Supporting Information *

ABSTRACT: In untargeted proteomics and metabolomics, raw data obtained with an LC/MS instrument are processed into a format that can be used for statistical analysis. Full scan MS data from chromatographic separation of biological samples are complex and analyte concentrations need to be extracted and aligned so that they can be compared across the samples. Several computer programs and methods have been developed for this purpose. There is still a need to improve the ease of use and feedback to the user because of the advanced multiparametric algorithms used. Here, we present and make publicly available, TracMass 2, a suite of computer programs that gives immediate graphical feedback to the data analyst on parameter settings and processing results, as well as producing state-of-the-art results. The main advantage of TracMass 2 is that the feedback and transparency of the processing steps generate confidence in the end result, which is a table of peak intensities. The data analyst can easily validate every step of the processing pipeline. Because the user receives feedback on how all parameter values affect the result before starting a lengthy computation, the user’s learning curve is enhanced and the total time used for data processing can be reduced. TracMass 2 has been released as open source and is included in the Supporting Information. We anticipate that TracMass 2 will set a new standard for how chemometrical algorithms are implemented in computer programs.

H

expect to see largely the same set of peaks, especially those with high intensity. Low intensity peaks originating from the compound may appear and disappear because of varying concentration or adduct formation. Peaks may have slightly different retention times because of chromatographic instability and varying sample properties. The m/z ratios may also differ to some extent because of instrument drift and statistical uncertainty. The data processing procedure should identify and characterize peaks in all samples and align the peaks so that their properties can be tabulated. Peaks with a certain m/z ratio corresponding to the same compound should be represented by one variable in the data table. Ideally, any variable should be free from information derived from other compounds with nearby retention times and m/z ratios. A common strategy for the data processing is to start with a preprocessing step, followed by peak detection and then alignment to correct for retention time shifts.4 Preprocessing may include baseline correction, noise reduction and extraction

ere, we consider the problem of processing data from nontargeted liquid or gas chromatography coupled to high resolution mass spectrometry (LC/MS and GC/MS) into tables that can be used for statistical analysis. In nontargeted analysis, the goal is to acquire signals from as many compounds as possible. For biological samples, for example, used in metabolomics or proteomics, the acquired data can be large and complex. Therefore, analysis of all signals manually is often not feasible and a number of software tools have been designed for automating the data processing to enable statistical analysis. Examples are MetAlign,1 XCMS,2 and MZmine.3 In a review of data processing for mass spectrometry-based metabolomics, Katajamaa and Orešič4 have presented a more comprehensive list of 14 free and 12 commercial software tools. The fundamental signals of interest here are chromatographic peaks at a certain retention time and m/z ratio, which we will refer to as peaks. The aim of the data processing is to construct a table with peak properties for all peaks in all samples. Peak properties can, for instance, include retention time, intensity, area, and width. In a particular sample, a compound would normally give rise to one or more peaks occurring at the same retention time provided it can be ionized and thereby detected in the mass spectrometer. In another similar sample, we would © 2014 American Chemical Society

Received: December 2, 2013 Accepted: March 10, 2014 Published: March 10, 2014 3435

dx.doi.org/10.1021/ac403905h | Anal. Chem. 2014, 86, 3435−3442

Analytical Chemistry

Article

of ion chromatograms. Alternative strategies include binning the data and solving the retention time shifts by fuzzy correlation,5 or applying two-way curve resolution to GC/MS data.6 In the preprocessing step, it is common to divide the data into mass channels to avoid having to search for peaks in the m/z dimension, for example, generation of extracted ion chromatograms, regions-of-interest (ROIs)7 and pure ion chromatograms.8 An extracted ion chromatogram is obtained by dividing the m/z axis at regular intervals and summing all intensities in each interval.4 The principle behind the ROI algorithm is that the m/z values deviate less where there are peaks, and therefore the algorithm searches for such regions.7,9 Pure ion chromatograms (PICs) are chromatograms created by connecting data points with similar masses. There are numerous strategies for peak detection:10 two examples are the use of wavelets7 and matched filtering.11 Many methods have also been developed for the alignment step12 that loosely can be categorized into two classes: warping and matching of peak lists. However, most approaches use a combination of both alignment classes. Poor alignment may reduce the power of subsequent statistical analyses, and hence additional manual investigation may be required if the results are unclear. Current software tools for data processing (and sometimes also statistical analysis) generally do not give detailed feedback on the behavior of the algorithms. The lack of feedback can make it difficult to establish how to use the software optimally. It may not be obvious how to adjust the parameters when, for example, scan frequency, mass resolution, chromatographic peak widths, and baselines can all change. In this paper, we present TracMass 2, a state-of-the-art and open source software for detecting and aligning peaks in chromatography−mass spectrometry data sets with centroided full-scan mass spectra. The main feature distinguishing TracMass 2 from existing software tools is quick visual feedback at every processing step. The visual feedback was designed to help the user set parameters and understand the process without needing to go into the program code. In addition, we introduce new algorithms for chromatogram extraction and alignment of chromatography−mass spectrometry data and assess these on several publicly available LC/MS data sets. An example of a GC/MS data set is found in the Supporting Information, Figure S-1 and Table S-1. TracMass 2 runs under MATLAB 2010b (Mathworks, Natick, US) or later and is included in the Supporting Information. We hereby release TracMass 2 with source code under GNU General Public License version 3 or any later version.

Scheme 1. (Left TracMass 2 Workflow and (Right) Workflow for Setting the Parameters in Each Step

strategy. If the m/z value is sufficiently close to that of a PIC in the previous scan, the data point is assigned the same PIC ID. If it is close to two different PICs, it is assigned the ID of the one that is closest. If two data points have the same PIC as their nearest neighbor in the previous scan, the one with the smallest m/z difference is assigned first and the other one is assigned as if PICs already assigned do not exist. If there are no PICs with similar m/z values in the previous scan, the data point is given a new PIC ID. The m/z values of the PICs are updated by an exponentially weighted moving average after the assignment. Once all scans are processed, PICs that are too short or have too low intensity are removed. The key parameters that can be adjusted are as follows: minimum intensity and length, which removes weak and short PICs, respectively; mass tolerance, mass anchor and mass transformation to determine the uncertainty in mass and its variation as a function of the mass (time-of-flight mass analyzers have an uncertainty that increases as the square root of the mass, whereas quadrupoles have a mass-independent uncertainty). Peak Detection. The peak detection is based on convolution with zero area filters.11 A zero area filter is the negative of the second derivative of a Gaussian model peak. Each PIC is convolved with two zero area filters of different widths. The maximum of the two filtered PICs is taken as an indicator chromatogram, where the positions of maxima correspond to peak locations in the original PIC. To create a decision limit for the peak intensity, a local estimate of the noise is used. The noise level is estimated for each data point as the weighted standard deviation of the difference between the PIC and its smooth. The smoothing is performed by convolution with a Gaussian filter, and the weighting for the standard deviation is also Gaussian with a different width. To prevent a too narrow filter from adapting to the noise, a correction factor is used. The correction factor is computed so that the local estimate of the noise is unbiased for a vector of random normal distributed values with a standard deviation of one. Peak intensities are recorded as the intensity of the smoothed signal at the position of the peak. The parameters for the peak detection step are: two zero area filter widths, the



METHODS The procedure for data processing in TracMass 2 consists of four steps: raw data inspection, creation of pure ion chromatograms, peak detection and alignment, where each step provides graphical feedback to the user. In this section, we give a brief overview of the algorithms and their parameters. The workflow is summarized in Scheme 1. Creation of Pure Ion Chromatograms. In the original version of TracMass, a Kalman filter was used to create pure ion chromatograms (PICs).8 In TracMass 2, the original algorithm has been substituted for a tracking algorithm without the overhead imposed by the Kalman filter. The new algorithm is much faster, while giving the identical results. The new algorithm starts by assigning a PIC ID to every data point in the first scan. In the next scan, all data points are assigned a PIC ID depending on their m/z values by a greedy nearest neighbor 3436

dx.doi.org/10.1021/ac403905h | Anal. Chem. 2014, 86, 3435−3442

Analytical Chemistry

Article

Table 1. Specifications for TracMass 2 property data input file formats output file formats programming language runtime environment required additional toolboxes license source code availability

value centroided full-scan chromatography-MS NetCDF, mzData, mzXML, mzML comma separated value file MATLAB (Natick, MA) MATLAB R2010b or higher (R2013b tested on OSX and Windows 7) GUILayout toolbox version 1 p14 (http://www.mathworks.com/matlabcentral/fileexchange/27758-gui-layout-toolbox), BSD license; Snctools version r4040 (http://mexcdf.sourceforge.net/), MIT license; SplashScreen version 1 p1 (http://www.mathworks.com/matlabcentral/ fileexchange/30508-splashscreen), BSD license; are all included with the complete program in Supporting Information GPL version 3 or later Supporting Information and authors upon request

⎛ i=1 2 2 argmin{α }M ⎜⎜ −∑ e−(si − zi) / σ + j j=0 ⎝ N

standard deviation of the smoothing filter, the signal-to-noise value required to accept a peak and the width of the local estimate of the noise. Alignment. The peaks are aligned so that peaks with the same m/z value from a given compound are assigned to a unique variable. The alignment procedure is divided into four steps: clustering, followed by warping and a second clustering, and finally, resolution of ambiguous clusters with the generalized fuzzy Hough transform (GFHT).13,14 The clustering steps start by scaling the mass and time axes by their expected uncertainty. Next, the peaks from all samples are connected by Delaunay triangulation and connections longer than one after the scaling are removed. Peaks still connected are considered to belong to the same cluster, which is identified using a breadth-first search algorithm. All clusters containing exactly one peak from each sample are used as landmark peaks for the warping. The parameters for clustering are the uncertainties in mass and time. In the warping step, P-splines15 are fitted to the retention time deviations from the means of the landmark clusters. The retention times for all peaks in a sample are adjusted by the retention time shifts from the warping function. The warping has only one parameter, which is the number of P-splines used. The smoothing parameter of the P-splines is set automatically by optimizing the generalized cross validation criterion (see Eilers15). The peaks are clustered a second time based on their warped retention times. The results from the clustering can be exported to a text file. The highest-intensity peak is reported for samples that have two or more peaks grouped together in a cluster, an ambiguous alignment result. The fourth step, resolution of ambiguous clusters using the generalized fuzzy Hough transform (GFHT), is optional. The GFHT is performed independently for each cluster, akin to its application to NMR data.13,14,16 A shift model is constructed from the retention times of up to twenty of the nearest landmark peaks using principal component analysis (PCA). A linear combination of M components from the PCA is used to model the retention time of the peaks in the cluster that is to be resolved: si = α0 + ΣM j=1mijαj, where si is the retention time of the peak in sample i, α0 is a constant that shifts the entire pattern, mij is the shift value of the jth pattern for sample i, αj is the coefficient of the jth pattern. The aim is to find a model (i.e., a set of α-values) that fits well to a subset of peaks in the cluster. The subset represents peaks from the same analyte and defines a new unambiguous cluster. The model is given by

p N

j=1

2 j



∑ e α ⎟⎟ M



The first term is the negative Hough score and the second term is a penalty; zi is the true position of the peak closest to the predicted position si in sample i, σ2 is the GFHT fuzzy parameter specifying the amount a peak is allowed to deviate from the predicted retention time, p is the penalty parameter and N is the number of samples. The new cluster comprises a subset of peaks that are sufficiently close to the model or closest in the case of ambiguity. The criterion of sufficiently close is defined by a parameter called time tolerance. To avoid local optima, several starting guesses are made for the coefficients of the model. The best guess is optimized with the Nelder−Mead simplex algorithm. To prevent overfitting, high values of the coefficients are penalized exponentially. The generalized fuzzy Hough transform has three parameters: number of components (M), time tolerance, and penalty (p). σ is equal to the time tolerance. A core feature of TracMass 2 is that the user can obtain feedback at every step in order to improve and validate the parameter settings. For the more time-consuming steps, it is possible to process a small part of the data set to increase the speed of parameter setting (Scheme 1).



RESULTS AND DISCUSSION TracMass 2 is primarily intended as a suite of software processing tools for nontargeted chromatography-MS data acquired in centroided full-scan mode. The software suite can be run as a single program where each processing step is reached through a tab-panel interface. The individual processing steps can also be run as separate programs. We have implemented the methods described above in MATLAB code with a focus on providing graphical feedback about algorithm behavior and parameter settings. The algorithms build upon existing algorithms and strategies but are partly or entirely new. Therefore, we used a number of publicly available data sets to benchmark TracMass 2 against other similar software tools as well as published results. Specifications and requirements for TracMass 2 are summarized in Table 1. In the following sections, we will explain the graphical feedback and method for finding appropriate parameter settings as well as compare performance results based on theoretical considerations and benchmark data sets. Performance comparisons were made for the individual processing steps: chromato3437

dx.doi.org/10.1021/ac403905h | Anal. Chem. 2014, 86, 3435−3442

Analytical Chemistry

Article

Figure 1. (Left) Parameters that can be changed. A change in the parameters will immediately change the results of the peak detection in the (right) chromatogram. Inspection of a few pure ion chromatograms allows the user to fine-tune the parameters as well as estimate the efficiency of the process. Peaks are detected where the green or cyan lines lie above the red dashed line.

The PICs contain peaks, baseline, and artifacts originating from column bleed or ions from the mobile phase. The discarded 90% mainly consists of signals that do not appear with consistent mass in sufficiently many subsequent scans, that is, the signal is inconsistent with the presence of a peak. The XCMS program extracts ROIs, which gives similar results, albeit with a completely different algorithm.7 The difference in the results between the two algorithms mainly stems from how mass uncertainty is treated: XCMS uses a constant relative error (default 25 ppm), whereas in TracMass 2, the uncertainty is constant or increases as (m/z)1/2 Thus, TracMass 2 finds more mass traces in the low mass region. The algorithm output is a list of mass traces (chromatograms) with different lengths that are subjected to peak detection. A PIC may contain noise and one or more peaks, partly depending on the parameter values. The graphical feedback is split into two views (see Supporting Information Figure S-3). One shows an m/z vs retention time overview where the raw data and PICs are shown. The other view is a three-dimensional plot of a single PIC, showing retention time, m/z and intensity as well as the raw data in the vicinity of the PIC. The m/z tolerance parameter is displayed in the second view so that the user can compare its value in relation to the mass uncertainty of the acquired data. The PIC displayed in the second view is chosen interactively by clicking on it in the overview or accessed by its number. The PICs are numbered in order of descending intensity. The Tracker tool allows a subset of a single sample (the subset is chosen by zooming in the overview plot), a complete single sample or a set of samples to be processed. We recommend that one should start by processing just part of a sample, then review the results and tune the m/z tolerance

gram extraction, peak detection and alignment. The parameters used can be found in the Supporting Information, Tables S-2 and S-3. Raw Data Inspection. The facility to inspect the raw data was included to provide a graphical interface for discarding outliers. Fatal errors, such as using the wrong sample or a run canceled half way through, are often easily discernible from inspecting the total ion or base peak chromatograms. If outliers are not discarded, they can pose a problem for the alignment step. In this part of the graphical user interface (GUI), one can switch between viewing total ion chromatograms and base peak chromatograms (see Supporting Information Figure S-2). We decided to keep the data inspection process simple, that is, by not reimplementing the powerful data viewers provided in most instrument vendor software. Instead, the samples are collected in a sample list and an outlier list to provide a clear overview and easy access for comparisons. One view displays chromatograms for all samples, while a second view allows users to choose which samples to display for detailed comparisons and validation of outliers. Chromatogram Extraction. Extracting chromatograms from centroided MS data is a convenient way of converting the data into a format that is efficient for peak detection algorithms. TracMass 2 uses a tracking algorithm to find socalled PICs, which rely on the fact that only one ion contributes to the chromatogram, in contrast to extracted ion chromatograms, where ions within an m/z-interval are used to create a chromatogram by summing, averaging or taking the most intense signal from each scan. Extracting PICs filters the data from noise; in the data sets analyzed here, the PICs typically comprise 10% of the original data. The amount of data covered by the PICs will vary depending on the samples, the instrument and its acquisition parameters and the data analysis parameters. 3438

dx.doi.org/10.1021/ac403905h | Anal. Chem. 2014, 86, 3435−3442

Analytical Chemistry

Article

exactly the same, possibly reflecting the fact that a later version of XCMS was used. The results are summarized in a Venn diagram (Figure 2), where it can be seen that about two

parameter. Thereafter, an entire sample could be processed to get an idea about the minimum intensity and length for retaining a PIC. Finally, when the results of the single sample have been checked, all samples of the data set can be processed. The processing time of a single sample may range from seconds up to minutes depending on the amount of data. By employing this three-step processing strategy for both tracking and peak detection, parameter tuning in TracMass 2 is both straightforward and fast. The complexity of the tracking algorithm is O(N log N), where N is the number data points in a sample. The ROI algorithm has identical complexity. The program MZmine 2.0 has a chromatogram generation step with a complexity of O(N2). When benchmarking the chromatogram extraction of TracMass 2, XCMS and MZmine 2 on a 400 MB sample, TracMass 2 and XCMS both finished in less than a minute, whereas MZmine 2.0 did not finish before the extraction process was canceled after a few hours. This result demonstrates the importance of having low algorithmic complexity. Peak Detection. Fundamental questions related to peak detection are: What was detected as peaks? What was missed? And, how do I set the parameters to achieve the desired results? These questions are best answered by interactive graphics. The first question can be addressed by displaying chromatogram snippets with the peaks clearly marked. This can be achieved with most current software tools. TracMass 2 uniquely provides graphical feedback to help with the latter two questions (Figure 1). The second question is the most difficult because it requires that peaks are found in regions where the peak detection algorithm failed (if indeed it did). If the ratio of chromatogram snippets without peaks to those with peaks is high, the task can be complex. Allowing the data analyst to browse chromatograms by intensity or m/z-retention time can help answer the question qualitatively in a reasonably short time. To set the parameters correctly, it is desirable to get immediate feedback for a few chromatograms of different kinds. The feedback should help the data analyst answer the first two questions for the chromatograms that are reviewed. This is the design philosophy behind TracMass 2. Other software tools force the user to select parameters and process a sample or even entire data set before they have chance to review the results. The presentation of chromatogram snippets with detected peaks makes it hard to draw conclusions about which parameters to change and how. Therefore, we have tried to make the peak detection algorithm in TracMass 2 transparent and present detailed results immediately upon parameter changes. This enables the user to rapidly see the effects of changing the parameters. This removes the need to test large numbers (fifty to hundreds) of parameter settings on a sample comprised of a mixture of known compounds to optimize the parameters as required in Tautenhahnn et al.17 The peak detection algorithm was compared to the centWave and matchedFilter algorithms of XCMS. The centWave algorithm uses the continuous wavelet transform to enable detection of both wide and narrow peaks within an interval of allowed widths, whereas matchedFilter uses a single zero-area filter. Our algorithm has two filters, and therefore is intermediate between centWave and matchedFilter. The three algorithms were benchmarked using the data set accompanying the centWave publication7 by repeating the processing procedure with the published parameters. The peak numbers for XCMS were similar to the original publication but not

Figure 2. Venn diagrams summarizing the results obtained with the peak detection algorithms of TracMass 2 and XCMS: (a) leaf data set and (b) seed data set.

hundred peaks were unique to TracMass 2 and thirty unique to XCMS. Manual inspection of these peaks showed that about half to two-thirds were real peaks, while one-third were classified as false positive detections (i.e., not peaks). Of the peaks unique to TracMass 2, all had a corresponding ROI, and for the XCMS-specific peaks, there were matching PICs for all but three of them. The main difference between the algorithms is the number of filters. We believe that the main advantage of the five wavelet filters used by centWave compared to matchedFilter is the inclusion of one narrow and one wide filter because we effectively found all the peaks found by centWave using only two filters. It could be argued that the use of more than two filters would increase the robustness as the algorithm becomes less sensitive to the exact filter specification. However, this was not borne out by our results, as we found more real peaks with two filters. Use of a large number filters can also accommodate a wide diversity in the peak widths. However, our results suggest that two appropriately chosen filters seem sufficient. The evaluation of the peak detection step demonstrates the value of visual feedback in facilitating the optimization of parameter settings. The three peak detection algorithms compared here are fairly similar, yet gave very different results. Some of the peaks that matchedFilter missed can be explained by that fact that the algorithm only uses one filter, and therefore is weak at detecting peaks with different widths. CentWave missed over four hundred peaks in the leaf data set, most likely due to the parameters used. The parameters for matchedFilter and CentWave were rigorously developed: around 50 different parameter settings were tested on data from a mixture of standards. The parameters for the peak detection step in TracMass 2 were set using one of the samples, relying on visual feedback to assess the quality of the results, and only nineteen peaks were missed in the leaf data set. By using visual feedback, both the accuracy and speed of setting the parameters can be increased. Any analytical chemist 3439

dx.doi.org/10.1021/ac403905h | Anal. Chem. 2014, 86, 3435−3442

Analytical Chemistry

Article

can identify a peak and change the parameters until they find peaks without getting false detections. Visual diagnostic tools can also give the user a good idea of how the parameters should be changed. Another advantage of providing visual feedback is that the parameters can be tuned for one sample and validated using another sample. Thus, there is no longer a need to use samples of a standard mixture to develop parameters for more complex data, where there is a risk that the added complexity may affect the peak detection results negatively. Alignment. Peak alignment is a four-step procedure in TracMass 2. The scheme, which includes clustering, retention time correction, a second clustering and collision resolution, is similar to the alignment procedures of, for example, XCMS,2 MZmine3 and MSFACTS.18 To validate the alignment results, we used the benchmark data set of Lange et al.17 together with their evaluation script for computing recall and precision values based on a ground truth. TracMass 2 generated results that were comparable to those obtained with the tools tested by Lange (Table 2).

The fourth step of our alignment procedure, utilizing the GFHT, was found to only change the precision and recall values in the third significant digit, which can be considered insignificant. However, use of the GFHT increased the number of landmark peaks by 4−24% in a number of data sets (Table 3). In the data set M2, the majority of the peak clusters were not landmark peaks, and therefore a large increase in the number of landmark peaks did not result in a significant increase in precision or recall. Cluster resolution with GFHT is optional. Results can be exported before or after applying the GFHT. After GFHT alignment, results are exported with a degree of redundancy: the exported data includes the unresolved as well as GFHT-resolved clusters. This choice was made to safeguard against worst-case performance of the GFHT. There is a small risk that a peak cluster is split erroneously. In the alignment evaluation, the data was exported without redundancy. The GFHT alignment step has a theoretical advantage that has so far received little attention in the field of chromatographic alignment, namely the multisample advantage. Based on knowledge of how peaks shift in the chromatogram, one can easily resolve ambiguities where one or more samples have two peaks while the majority has one peak (Figure 3a). Eilers19 has shown that retention time shifts can be modeled using run order, using thirteen chromatograms to align an entire data set with 460 chromatograms. In NMR, we have shown that with multiple samples, peaks can be correctly aligned in situations that were previously thought impossible,13 for example, when peaks change order along the measurement axis. Changes of order are common in 1D-NMR data for metabolomics on biological fluids. The shift problems in LC are less complex than those in NMR, but there are situations where warping cannot correctly gather peaks into tight clusters. Peaks with different m/z may shift by a different degree, resulting in them changing positions along the time axis (Figure 3b). The implementation of GFHT in TracMass 2 does not fully exploit the multisample advantage. Therefore, further research is needed to lessen the compromise between computational complexity and exploitation of the multisample advantage. Feedback for the GFHT step is an interactive overview where all clusters resolved by GFHT are displayed. Clicking on one such cluster displays a detailed plot with observed and predicted peak positions. The objective of TracMass 2 is to process chromatographyMS data to become fit for analysis using standard statistical methods. Because the statistical analysis is very dependent on the research question at hand, we have chosen to leave statistical analysis and identification out of TracMass 2. Tentative identities of unknown compounds can be found by manual queries to a mass spectrum database, for example, HMDB or Metlin, using the mass values extracted by TracMass

Table 2. Data Sets Used for Validation and Increase in the Number of Landmark Peaks Using GFHT data set

ref

increase in landmark peaks using GFHT (%)

orange juice M1 M2 seed leaf

20 17 17 7 7

+4 +4 +24 +8 +5

In TracMass 2, visual feedback for the clustering step consists of a diagram where each cluster is represented by a different color and symbol. The user can zoom in and check whether the cluster structure makes sense. One can choose to process only a small m/z-time region for speed. Once these results are considered satisfactory, the entire data set can then be clustered. The algorithmic complexity of the clustering is O(N log N). The second step is warping, which has a single parameter: the number of splines that constitute the warping function. Here, feedback consists of a graphical display of the warping functions for every sample and a second display where the user chooses which sample to display (see Supporting Information Figure S-4). The second display shows the warping functions together with the landmark peaks that were used to fit the warping function. Samples are color coded. The second clustering is identical to the first with the exception that it is based on warped retention times. This should give more compact clusters, which can be validated by visual inspection, and more landmark peaks.

Table 3. Validation Results for Alignment on Benchmark Data Set by Lange et al.17 data set

TracMass 2

msInspecta

MZminea

OpenMSa

recall precision

0.88 0.67

0.27 0.46

0.89 0.74

0.87 0.69

recall precision

0.93 0.80

0.23 0.47

0.98 0.84

0.93 0.79

XAligna

XCMS without RT correctiona

XCMS with RT correctiona

0.88 0.70

0.98 0.60

0.94 0.70

0.93 0.79

0.97 0.58

0.98 0.78

M1

M2

a

Data from Lange et al.17 3440

dx.doi.org/10.1021/ac403905h | Anal. Chem. 2014, 86, 3435−3442

Analytical Chemistry

Article

diagnostics tools. If the full process is too lengthy, the data set can be divided and only a small part processed to test parameters. This can greatly increase both the speed and accuracy of the data analysis. TracMass 2 was developed with the aim of utilizing these advantages.



CONCLUSIONS In untargeted analysis, the goal is to detect and acquire information about the concentration of as many analytes as possible, requiring a large amount of data to be processed. To ensure that processing is done correctly, feedback is needed. Use of poor models or unsuitable parameters can degrade the processed data. In a sense, all data processing gives feedback via its end results, which may be adequate for statistical analysis but not for quality control. The optimization of parameters by repeated processing of an entire data set is generally timeconsuming. It can also be difficult to determine which parameters should be changed and how. Examination of the results from each individual processing step can reveal more about the parameters than looking at the final result. It may be difficult to discern, for example, how many peaks have been missed from tabulated data, but this can readily be seen in a chromatogram where detected peaks are marked. Good feedback allows the user to assess whether the results are reliable, and if the results are not satisfactory, helps the user correct the parameters. By visually exposing the inner components of an algorithm, the user can gain an insight into the workings of the algorithm and how the parameters should be set for obtaining desired results. In this way, feedback can help improve the quality of the results of the data processing. The implementation of immediate visual feedback in TracMass 2 helps the user to apply and understand advanced algorithms. It allows experimentalists with little chemometric background to utilize advanced chemometric tools. By processing a small subset of the data and inspecting the results manually, the user can be more confident in the results and set the parameters efficiently. We hope to set a new standard for the implementation of chemometric algorithms with immediate visual feedback at every step.

Figure 3. Examples of difficult alignment situations. (a) A cluster with one additional peak removed from the cluster by GFHT (o, predicted peak positions; ×, observed peak positions; □, additional peak). (b) Two clusters of peaks at the same retention time but with different mass. Warping cannot align both to compact clusters. The samples in panel b have been sorted according to the retention time of the more variable cluster.



ASSOCIATED CONTENT

S Supporting Information *

Additional material as described in the text. This material is available free of charge via the Internet at http://pubs.acs.org.



2. True positive identification requires additional experimental work with MS/MS and a reference standard. The main rationale in the choice of the algorithms is that they should be simple and have parameters that can be easily tested with diagnostics tools, and that the algorithms are fast or can be used on a subset of the data for efficient parametrization. They have been evaluated against other algorithms, not only to show that they have comparable performance, but also to show the importance of using good parameters. For accurate results, the settings used in an analysis method need to be optimized at every step: from sampling to data analysis. To ensure reliable results, the analysis method needs to be validated at every step. There are some unique opportunities in data analysis: it is possible extract information from a data analysis process, which can be used to create

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Notes

The authors declare no competing financial interest.

■ ■

ACKNOWLEDGMENTS AstraZeneca is acknowledged for financial support during part of the algorithm development. REFERENCES

(1) Lommen, A.; Kools, H. Metabolomics 2012, 8, 719−726. (2) Smith, C.; Want, E.; O’Maille, G.; Abagyan, R.; Siuzdak, G. Anal. Chem. 2006, 78, 779−787. Tautenhahn, R.; Patti, G. J.; Rinehart, D.; Siuzdak, G. Anal. Chem. 2012, 84, 5035−5039. 3441

dx.doi.org/10.1021/ac403905h | Anal. Chem. 2014, 86, 3435−3442

Analytical Chemistry

Article

(3) Pluskal, T.; Castillo, S.; Villar-Briones, A.; Oresic, M. BMC Bioinf. 2010, 11, No. 395. (4) Katajamaa, M.; Orešič, M. J.Chromatogr. A 2007, 1158, 318−328. (5) Ullsten, S.; Danielsson, R.; Bäckström, D.; Sjöberg, P.; Bergquist, J. J. Chromatogr. A 2006, 1117, 87−93. (6) Jonsson, P.; Johansson, E. S.; Wuolikainen, A.; Lindberg, J.; Schuppe-Koistinen, I.; Kusano, M.; Sjöström, M.; Trygg, J.; Moritz, T.; Antti, H. J. Proteome Res. 2006, 5, 1407−1414. Shen, H.; Grung, B.; Kvalheim, O. M.; Eide, I. Anal. Chim. Acta 2001, 446, 311−326. (7) Tautenhahn, R.; Bottcher, C.; Neumann, S. BMC Bioinf. 2008, 9, 504. (8) Åberg, K. M.; Torgrip, R. J. O.; Kolmert, J.; Schuppe-Koistinen, I.; Lindberg, J. J. Chromatogr.A 2008, 1192, 139−146. (9) Stolt, R.; Torgrip, R. J. O.; Lindberg, J.; Csenki, L.; Kolmert, J.; Schuppe-Koistinen, I.; Jacobsson, S. P. Anal. Chem. 2006, 78, 975− 983. (10) Zhang, J.; Gonzalez, E.; Hestilow, T.; Haskins, W.; Huang, Y. Curr.Genomics 2009, 10, 388−401. (11) Danielsson, R.; Bylund, D.; Markides, K. E. Anal. Chim. Acta 2002, 454, 167−184. (12) Bloemberg, T. G.; Gerretzen, J.; Lunshof, A.; Wehrens, R.; Buydens, L. M. C. Anal. Chim. Acta 2013, 781, 14−32. Prince, J. T.; Marcotte, E. M. Anal. Chem. 2006, 78, 6140−6152. Åberg, K. M.; Alm, E.; Torgrip, R. O. Anal. Bioanal. Chem. 2009, 394, 151−162. (13) Csenki, L.; Alm, E.; Torgrip, R. O.; Åberg, K. M.; Nord, L.; Schuppe-Koistinen, I.; Lindberg, J. Anal. Bioanal. Chem. 2007, 389, 875−885. (14) Alm, E.; Torgrip, R. O.; Åberg, K. M.; Schuppe-Koistinen, I.; Lindberg, J. Anal. Bioanal. Chem. 2009, 395, 213−223. (15) Eilers, P. H. C.; Marx, B. D. Stat. Sci. 1996, 11, 89−102. (16) Alm, E.; Slagbrand, T.; Åberg, K. M.; Wahlström, E.; Gustafsson, I.; Lindberg, J. Anal. Bioanal. Chem. 2012, 403, 443−455. (17) Lange, E.; Tautenhahn, R.; Neumann, S.; Gröpl, C. BMC Bioinf. 2008, 9, No. 375. (18) Duran, A. L.; Yang, J.; Wang, L.; Sumner, L. W. BMC Bioinf. 2003, 19, 2283−2293. (19) Eilers, P. H. C. Anal. Chem. 2003, 76, 404−411. (20) Tengstrand, E.; Rosén, J.; Hellenäs, K.-E.; Åberg, K. M. Anal. Bioanal. Chem. 2013, 405, 1237−1243.

3442

dx.doi.org/10.1021/ac403905h | Anal. Chem. 2014, 86, 3435−3442