Web Server for Peak Detection, Baseline ... - ACS Publications

Sep 27, 2016 - Web Server for Peak Detection, Baseline Correction, and Alignment in Two-Dimensional Gas Chromatography Mass Spectrometry-Based. Metabo...
2 downloads 10 Views 2MB Size
Subscriber access provided by CORNELL UNIVERSITY LIBRARY

Article

A Web Server for Peak Detection, Baseline Correction and Alignment in Two-dimensional Gas Chromatography Mass Spectrometry-based Metabolomics Data Tze-Feng Tian, San-Yuan Wang, Tien-Chueh Kuo, Cheng-En Tan, Guan- Yuan Chen, Ching-Hua Kuo, Chi-Hsin Chen, Chang-Chuan Chan, Olivia A. Lin, and Yufeng Jane Tseng Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.6b00755 • Publication Date (Web): 27 Sep 2016 Downloaded from http://pubs.acs.org on September 28, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Analytical Chemistry is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 28

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 28

A Web Server for Peak Detection, Baseline Correction and Alignment in Two-dimensional Gas Chromatography Mass Spectrometry-based Metabolomics Data Tze-Feng Tian†,‡, San-Yuan Wang†,‡, Tien-Chueh Kuo‡,⊥, Cheng-En Tan†,‡,Guan-Yuan Chen‡,§, ChingHua Kuo‡,§,∥, Chi-Hsin Sally Chen#, Chang-Chuan Chan#, Olivia A. Lin⊥, and Y. Jane Tseng*,†,‡,§,⊥ †

Department of Computer Science and Information Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Rd., Taipei 10617, Taiwan



The Metabolomics Core Laboratory, Center of Genomic Medicine, National Taiwan University, No.2, Syu-Jhou Rd., Taipei 10055, Taiwan

§

School of Pharmacy, College of Medicine, National Taiwan University, No.33, Linsen S. Rd., Taipei 100, Taiwan



Department of Pharmacy, National Taiwan University Hospital, National Taiwan University, No.1, Changde St., Taipei 10048, Taiwan



Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, No. 1, Sec. 4, Roosevelt Rd., Taipei 10617, Taiwan

#

Institute of Occupational Medicine and Industrial Hygiene, National Taiwan University, No. 17, Xuzhou Rd., Taipei 100, Taiwan ACS Paragon Plus Environment

1

Page 3 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

National Taiwan University, Taipei, Taiwan *

Corresponding Author Voice: +886.2.3366.4888#529 Fax: +886.2.23628167 E-mail: [email protected]

ACS Paragon Plus Environment

2

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 28

ABSTRACT

Two-dimensional gas chromatography time-of-flight mass spectrometry (GCxGC/TOF-MS) is superior for chromatographic separation and provides great sensitivity for complex biological fluid analysis in metabolomics. However, GCxGC/TOF-MS data processing is currently limited to vendor software and typically requires several preprocessing steps. In this work, we implement a web-based platform, which we call GC2MS, to facilitate the application of recent advances in GCxGC/TOF-MS, especially for metabolomics studies. The core processing workflow of GC2MS consists of blob/peak detection, baseline correction, and blob alignment. GC2MS treats GCxGC/TOF-MS data as pictures and clusters the pixels as blobs according to the brightness of each pixel to generate a blob table. GC2MS then aligns the blobs of two GCxGC/TOF-MS datasets according to their distance and similarity. The blob distance and similarity are the Euclidean distance of the first and second retention times of two blobs and the Pearson’s correlation coefficient of the two mass spectra, respectively. GC2MS also directly corrects the raw data baseline. The analytical performance of GC2MS was evaluated using GCxGC/TOF-MS datasets of Angelica sinensis compounds acquired under different experimental conditions and of human plasma samples. The results show that GC2MS is an easy-to-use tool for detecting peaks and correcting baselines, and GC2MS is able to align GCxGC/TOF-MS datasets acquired under different experimental conditions. GC2MS is freely accessible at http://gc2ms.web.cmdm.tw.

KEYWORDS Two-Dimensional Gas Chromatography Time-of-Flight Mass Spectrometry, Alignment, GC2MS

ACS Paragon Plus Environment

3

Page 5 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

INTRODUCTION Two-dimensional gas chromatography time-of-flight mass spectrometry (GCxGC/TOF-MS) is an analytical tool that is well-suited for semi-volatile and volatile analyte evaluation due to its enhanced separation capacity and combination with mass spectrometry (MS).1 The enhanced chemical selectivity, chemical sensitivity, separation capacity, dynamic range, and signal-to-noise ratio are beneficial for the analysis of the complex contents of microorganisms2-6, mammals7-11, plants12-15, and other metabolite samples. For example, Whitener et al. used GCxGC/TOF-MS to profile Sauvignon Blanc co-fermented with different yeasts16, and Sweetman et al. used GCxGC/TOF-MS to analyze organic acids in urine17. Shi et al. reported several advantages of using GCxGC/TOF-MS to analyze complex samples, including increased separation capacity, better signal-to-noise ratios, and improved dynamic range.18 Improved resolution and increased peak capacity by an order of magnitude are achievable by one-dimensional GC-MS19 because the combination with MS widens the dynamic range and heightens the sensitivity compared to conventional one-dimensional gas chromatography system. The overall workflow of GCxGC/TOF-MS data analysis is as follows: (1) acquiring and configuring data for storage, access, and evaluation; (2) processing data to remove unwanted artifacts and peaks; (3) identifying the chemical constituents; and (4) analyzing datasets for higher-level information and reporting.20 Currently, there is no free web-based platform for processing two-dimensional gas chromatography data. The outcomes obtained using the chemometric method are dependent on the reproducibility of the GCxGC/TOF-MS data, which contributes greatly to the explanatory analysis of complex biological systems in metabolomics7, 21-22 and proteomics23-24 applications. Ideally, comprehensive GCxGC/TOFMS can provide reproducible separation datasets,25 although a retention time (RT) shift phenomenon is always observed among GCxGC/TOF-MS data. This RT shift might be caused by pressure and temperature fluctuations, stationary phase degradation, and matrix effects. In chromatographic runs,

ACS Paragon Plus Environment

4

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 28

comparing metabolic profiles with RT shifts is challenging. Hence, metabolite peak alignment is an important task and is our goal in the preprocessing of GCxGC/TOF-MS data. Metabolite peak alignment in mass-based platforms can be categorized into two general approaches. One approach is to adjust the RT of the raw data to a common RT. Frage et al. used the generalized rank annihilation method (GRAM) with a rank-based algorithm to adjust the RT shift in two-dimensional gas chromatography.26 Mispelaar et al. aligned a local region of the GCxGC chromatograms with a correlation-optimized shifting-based algorithm.27 These two methods require internal standards, and their common weakness is that they only focus on small regions of interest. Pierce et al. reported a two-dimensional RT alignment method using an indexing scheme.28 Zhang et al. extended the correlation-optimized warping methods from a one-dimensional gas chromatography system to a two-dimensional gas chromatography system.19 The methods reported by Pierce et al. and Zhang et al. align the data based on two-dimensional RT alone, without considering the mass spectrum of fragmented ions. Thus, both methods have the disadvantage of incorrect alignment. The other alignment approach is to align the detected metabolite peaks of all the samples. Oh et al. developed MSort by considering both the similarity of the mass spectra and the differences in the RTs of the detected peaks.29 Wang et al. improved MSort by using an algorithm to optimize the alignment of the distance and spectrum correlation.30 These two methods provided a better false-positive rate for the alignment. Kim et al. further advanced MSort and designed an alignment mechanism for homogeneous data31 when the compound RT shifts are not large. To correctly compare metabolic profiles, we developed GC2MS, a web-based platform used to process data generated by GCxGC/TOF-MS. Before aligning the GCxGC/TOF-MS data, GC2MS corrects the baseline and detects the metabolite peaks of the corrected GCxGC/TOF-MS. Consequently, GC2MS aligns the detected metabolite peaks according to the distance of the peaks in the time scale and the similarity of the peak spectra of the GCxGC/TOF-MS data. The performance of GC2MS was evaluated using existing datasets of Angelica sinensis under different experimental conditions and of a ACS Paragon Plus Environment

5

Page 7 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

GCxGC/TOF-MS-based metabolomic study of 13 breast cancer subjects and 12 healthy volunteers. We also compared the performance of two GC2MS alignment algorithms applied to the GCxGC/TOF-MS datasets and showed that the two algorithms are useful in different experimental environments.

ACS Paragon Plus Environment

6

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 28

METHODS Overview of the main algorithm procedures. GC2MS is a fully functional, web-based GCxGC/TOF-MS-based data analysis platform (gc2ms.web.cmdm.tw) implemented with Ruby (version 2.1.2p95) on Rails (version 4.1.6) (http://rubyonrails.org). GC2MS incorporates baseline correction, peak detection, and alignment. A snapshot of the GC2MS homepage is shown in Figure 1. Users can upload the netCDF data generated from the instrument to GC2MS. Baseline correction in GC2MS utilizes a six-step stride-pixel detection method to adjust the baseline for accurate quantification and peak integration. After baseline correction, GC2MS can identify clusters of brighter or darker blob pixels in the chromatogram of each GCxGC/TOF-MS raw dataset and align the blobs by the Euclidean distance of the selected blobs and the similarity of their mass spectra. The Euclidean distance is the distance between the first and second RTs of the two blobs, and the similarity is calculated based on the Pearson’s correlation coefficient of the two mass spectra. This algorithm uses blob tables to perform alignment, in contrast to other methods that use peak tables. The GC2MS alignment algorithm is advantageous for many reasons; most notably, it can be applied to GCxGC/TOF-MS data produced under inconsistent instrument environments (e.g., fluctuations in pressure and temperature, degradation of the stationary phase, and matrix effects) and adjust the shifts in the RT along two chromatographic dimensions. After aligning the uploaded data, GC2MS automatically sends an e-mail to notify users to download the aligned results. The overall workflow of GC2MS is illustrated in Figure 2.

ACS Paragon Plus Environment

7

Page 9 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Figure 1. Snapshot of the GC2MS homepage.

ACS Paragon Plus Environment

8

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 28

Figure 2. Overall workflow of GC2MS: (a) netCDF data loading, (b) baseline adjustment, (c) peak detection from baseline-corrected object and preliminary peak table generation, (d) peak table alignment, and final peak table generation. Raw data preprocessing. LECO ChromaTOF software (LECO Corp., St. Joseph, MI) is used for instrument control and spectrum deconvolution. The parameters for the instrument are described in the supplementary information. Each raw instrument dataset is converted into the netCDF file format for further processing in GC2MS. Data parsing is performed via the statistical programming language R32. The raw data in netCDF format can be divided into several crucial sections based on factors such as the total intensity, scan index, scan acquisition time, mass-intensity pair value, point count in each scan and other parameters generated by the instrument. The total intensity and scan acquisition time are stored as ACS Paragon Plus Environment

9

Page 11 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

sequential patterns that require conversion into a two-dimensional matrix structure before further analysis. The function readRaw in GC2MS takes the RT and intensity in the netCDF files as the input and generates an m × n matrix—n indicating the number of scans in each column where the run time is n seconds with m scans. Baseline correction. To accurately quantify the analyte peaks and provide more accurate peak integration, it is necessary to subtract the baseline from the signal. GC2MS first applies a smoothing function to each stride and then estimates the local mean () and standard deviation (). The estimated background level of each stride is calculated using equation 1.  + 1.96 × 

(1)

Then, GC2MS adopts the ,  of Reichenbach et al.33 to comprehensively extract the GCxGC baseline by estimating the baseline level with the chromatographic information and the properties of the GCxGC data. Blob detection. Before blob detection, data table X is generated by summing the intensities of every ion in the spectra. Xi,j is the total intensity of the spectrum whose RT1 and RT2 are i and j. A blob is a cluster of pixels that are brighter than their surroundings. GC2MS treats X as a picture and detects the bright parts in the picture. The goal of blob detection is to find peaks or signals generated by an analyte (can be a specific analyte, such as internal standard) in a GCxGC chromatogram. In blob detection, clusters of pixels containing peaks are aggregated. A detected blob might be formed from several co-eluted analyte peaks, or vice versa, a single analyte peak might be detected incorrectly as several blobs. Blob detection is performed simultaneously in both dimensions, as follows: (1) Sort the pixels by decreasing intensity of each pixel and denote the pixel vector as P. (2) Start from the top pixel p1 and assign the top pixel blob number 1. (3) For each pixel pi in P, if the position of pi is adjacent to pi-1, pi is assigned the blob number of pi1.

Otherwise, the blob number is increased by 1 and assigned to pi. ACS Paragon Plus Environment

10

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 28

(4) Continue the loop until no peaks can be found in P. (5) Select the highest intensity peak of each blob and denote the blob vector as B. (6) Match each blob bi in B with the blobs in B b |b , b , b … b. If the Pearson’s correlation coefficient of bi and bk is larger than 0.95, the number of blobs of bk is updated and assigned as the number of blobs of bi. (7) Continue the loop until no blobs can be found in B. Alignment of the retention-time shift. The alignment functions in GC2MS are designed for two commonly observed scenarios: samples acquired from a consistent instrument environment (smaller RT shift) and samples acquired from an inconsistent instrument environment (larger RT shift), for instance, different temperature gradients. We assume that samples acquired from a consistent instrument environment possess smaller RT shifts than those acquired from an inconsistent instrument environment due to uncontrollable experimental factors. For each sample, GC2MS aligns all metabolite blobs of all replicate injections into a single blob table. GC2MS generates this blob table according to the method of Wang et al.30. For a smaller RT shift situation (method I, GC2MS-1 alignment algorithm), GC2MS computes the Euclidean distance and the Pearson’s correlation coefficient of the mass spectrum between the reference blob and all non-reference blobs. If the Euclidean distance is shorter than a user-specified distance and the Pearson’s correlation coefficient is greater than the threshold (default is 0.95), the nonreference blob with the minimum Euclidean distance is aligned to the reference blob. For a larger RT shift situation (method II, GC2MS-2 alignment algorithm), GC2MS directly computes the Pearson’s correlation coefficient of the spectra between the reference blob and the non-reference blobs instead of the Euclidean distance calculation between the reference blob and the non-reference blobs. If the correlation coefficient is greater than 0.95, then the non-reference blob with the highest coefficient is selected as the matched blob.

ACS Paragon Plus Environment

11

Page 13 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

EXPERIMENTAL SECTION Preparation of Angelica sinensis sample extract. Angelica sinensis (Danggui) samples were obtained from a local market. Angelica sinensis powder was extracted by a polydimethylsiloxane solidphase microextraction (SPME) fiber (Supelco, Bellefonte, IL) for 30 min, followed by desorption of the fiber in the GC injector port at 250°C for 1 min in splitless mode. The SPME extract of Angelica sinensis was run on the GCxGC/TOF-MS 5 times using three temperature gradients (3, 5 and 7°C/min) to introduce substantial shifting. Preparation of the human plasma sample extract. Twenty-five plasma samples and a pooled quality control (QC) sample were extracted with 400 µl of methanol and dried using a SpeedVac (Tokyo Rikakikai, Tokyo, Japan). The resulting residue was derivatized using methoxyamine hydrochloride (40 mg/ml) and a derivatization agent (MSTFA+TMCS, 99:1). GCxGC/TOF-MS analysis. The GCxGC/TOF-MS analyses for this study were performed using a LECO Pegasus 4D time-of-flight mass spectrometer (Leco Corporation, St. Joseph, MI, USA). The Pegasus 4D mass spectrometer was equipped with an Agilent 7890a gas chromatograph connected to a LECO two-stage cryogenic modulator and a secondary oven. The first-dimension chromatographic column consisted of a 30 m DB-5MS capillary column (5% phenyl, 95% dimethylpolysiloxane) with an internal diameter of 0.25 mm (Agilent Technologies, Santa Clara, CA). The second-dimension chromatographic

column

had

a

1

m

RXI-17

capillary

column

(50%

diphenyl,

50%

dimethylpolysiloxane) with an internal diameter of 0.1 mm (Bellefonte, PA, USA). For the Angelica sinensis samples, the first-dimension column was set to an initial oven temperature of 80°C and was then increased to 250°C at rates of 3, 5, and 7°C/min. The initial oven temperature of the second-dimension column was 100°C, 20°C higher than that of the first-dimension column, and the temperature was gradually increased at the same rate as previously described. The other parameters were as follows: fixed MS mass range of 40-500 m/z, acquisition rate of 130 spectra/sec, and ACS Paragon Plus Environment

12

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 28

helium carrier gas flow rate of 1 ml/min. For the human plasma samples, the first-dimension column was set to an initial oven temperature of 50°C for 5 min, was increased to 120°C at a rate of 10°C/min, and was then increased to 295°C at a rate of 15°C/min. The final oven temperature was held for 5 min. The second-dimension column was set to an initial oven temperature of 60°C for 5 min, was increased to 130°C at a rate of 10°C/min, and then increased to 305°C at a rate 15°C/min. The final oven temperature was held for 5 min. The other parameters were as follows: fixed MS mass range of 85-700 m/z, acquisition rate of 120 spectra/sec, and helium carrier gas flow rate of 1 ml/min.

RESULTS AND DISCUSSION GC2MS has the advantage of directly accepting raw netCDF data generated from a GCxGC/TOFMS instrument. Using raw data helps to preserve more of the potentially important features for each sample. To demonstrate how GC2MS resolves RT shift issues, including those arising from consistent or inconsistent instrument environments, the data for the analysis includes measurements made using an identical GCxGC configuration at three distinct column temperature ramps: 3, 5, and 7°C/min. The experiments at three different temperature ramps were repeated three times, that is, three replicate injections at 3, 5 and 7°C/min, to objectively and accurately assess the performance of GC2MS. Each experimental result was denoted Str, where t is the temperature ramp and r indicates the experimental run. For instance, S32 denotes the second experimental run at 3°C/min. Baseline of Angelica sinensis samples. Baseline correction is an optional preprocessing step to accurately quantify peaks in GC2MS. In GCxGC data, a flat baseline is usually observed at many points, especially during the void time of each second-column separation. Thus, GC2MS utilizes this attribute of GCxGC to adjust the baseline for accurate quantification and peak integration. Generally, the baseline does not change significantly over the brief time of a few modulation cycles; these observations ACS Paragon Plus Environment

13

Page 15 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

are used to reconstruct the baseline in a comprehensive fashion. Figure S-1 shows the contour plots of the samples with three replicates acquired with temperature ramps of 3°C/min, 5°C/min and 7°C/min. Before baseline correction, there are unstable baselines along the y-axis for 2500 seconds of RT1 in S31, S32 and S33. In the second column of Figure S-1, S51, S52 and S53 display unstable baselines for 2000 seconds of RT. The unstable band was subtracted after baseline correction, as shown in Figure 3. Figure S-2 shows the total intensity and baseline of the RT2-intensity chromatograms in S52. The sixteen chromatograms were randomly selected from 668 RT2-intensity chromatograms in S52 (the number on the left side of the slashes indicates the number of chromatograms along RT1). The red line in the figure indicates the baseline of two strides (two strides equal one column of RT2). The baseline is constructed based on the statistical properties of each stride. The top left (445/668) of Figure S-2 shows that the baseline is constructed along the second RT because the computed vectors are identified as background noise. However, according to equation 3, some peaks with higher slopes would be identified as signals even though the adjacent peaks are baselines, as shown in the top left (445/668) and the bottom left (449/668) of Figure S-2.

ACS Paragon Plus Environment

14

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 28

Figure 3. Comprehensive GCxGC/TOF-MS contour plots of the total ion chromatograms after baseline correction with three replicates acquired with temperature gradients of 3°C/min, 5°C/min and 7°C/min. Results of blob detection. Blob detection is the key step in data preprocessing before alignment because the accuracy of the alignment process is highly dependent on the output of the blob detection. The proposed blob detection algorithm is a “greedy” dilation algorithm because it detects blobs from the largest intensity to the lowest intensity. One way to accelerate blob detection is to filter peak intensities larger than a particular threshold such that only a peak intensity larger than 1,000,000 will be detected. ACS Paragon Plus Environment

15

Page 17 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Filtration is useful because most of the peaks in GCxGC are significantly higher than the background noise (i.e., high S/N ratio). When all the blobs are detected, GC2MS selects the highest-intensity blob and uses the raw ion fragment spectrum to merge highly correlated blobs (default correlation coefficient threshold is 0.95). Blob merging is greatly affected by the fragment spectra information. Two blobs are merged if they have high similarity. The more fragment spectra features that are provided, the lower the probability that two different metabolite blobs will be incorrectly merged. Because the fragment spectra information is completely conserved, the accuracy of this blob merging approach can be high. Figure 4 shows the results of blob detection followed by baseline correction. The red circles indicate the highestintensity blobs of each metabolite in the nine samples.

ACS Paragon Plus Environment

16

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 28

Figure 4. Comprehensive GCxGC/TOF-MS contour plots of the total ion chromatograms after blob detection (red circles) followed by baseline correction with three replicates acquired with temperature gradients of 3°C/min, 5°C/min and 7°C/min.

ACS Paragon Plus Environment

17

Page 19 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Comparison of two alignment methods. For a consistent instrument environment setting, GC2MS aligns the peaks based on the spectrum and RT of the reference peaks. If a target sample peak is present in the reference peak list, its two-dimensional RT is corrected to reflect the same values as the corresponding reference peak. The first and second RT of the reference peak are adjusted using equation 1 and equation 2 in Figure 5. In this figure, the results of the alignment of the samples of S5 are shown (S3 and S7 are shown in Figure S-3). In S3, all the blobs are assumed to have a consistent instrument environment. The red circles, green circles and blue circles indicate the samples of the replicates of S31, S32 and S33, respectively. The left figures show the distributions of the blobs before alignment, and the aligned blobs are shown in the right figures. According to equation 1 and equation 2, the reference blob is stretched and compressed by homogeneous blobs, and all the homogeneous blobs are aligned to the reference blob. Therefore, only the blue circles are shown because the red circles and green circles have the same RTs as the blue circles. The GC2MS-1 alignment algorithm can solve the RT shift problem in a homogenous instrumental environment. However, its performance is affected by large RT variation, such as in a heterogeneous instrumental environment. We use the GC2MS-1 and GC2MS-2 alignment algorithms to analyze the same datasets and to calculate the average Euclidean distance between the blobs under homogeneous and heterogeneous instrumental environments, respectively. Figure 6ab shows the alignment results of 23 blobs (indicated by different colors) of 9 samples with different temperature gradients. Circles of the same color indicate metabolites with high spectral similarity (Pearson’s correlation coefficient larger than 0.95). In Figure 6a, blobs with the same temperature gradient tend to have a smaller distance, whereas blobs from different temperature gradients tend to shift from the bottom left to the top right or vice versa, for instance, the light green blobs displayed as three “triple-blob-groups” on the left side of Figure 6ab. The “triple-blob-groups” have three blobs in each group, and they are very close to each other. After GC2MS-1 alignment, two “triple-blob-group”-RTs are adjusted and aligned. Only four blobs are shown after the alignment by GC2MS-1. One “triple-blob-group” is not aligned because the

ACS Paragon Plus Environment

18

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 28

closest blobs to the aligned “triple-blob-group” are not light green. This case changes when the chromatogram is aligned by GC2MS-2, which is able to obtain the highest Pearson’s correlation coefficient blobs even though they are shifted far from their sources. As a result, the light green blobs are aligned together (they overlap each other with the same RT). The average Euclidean distances in both cases are shown in Figure 6c. The number in each column represents the number of blobs (a total of 23 blobs were aligned), and the rows of the tables represent the average Euclidean distance obtained by GC2MS-1 and GC2MS-2. Throughout the comparison, GC2MS-2 shows better performance than GC2MS-1, likely because the pattern of shifting is not linear for all blobs, that is, the variation of RT1 and RT2 of each blob is not directly proportional to the temperature gradient (as shown in Figure 6ab). The GC2MS-1 alignment algorithm, which uses the distance measure of homogeneous blob determination as the means of matching, might have a high false positive rate in such a situation because the distance between adjacent blobs is smaller than that of the original shifted blob. Figure 6b shows the scatter plots of the blobs detected in different samples acquired from different temperature gradients using two GC2MS-2 alignment algorithms. Before alignment, the RT shift of each blob varies greatly with the temperature gradient. According to equation 2 and equation 3, the blobs are warped with respect to the RT and intensity of each blob. The average Euclidean distance of each peak is shown in Figure 6c. These results show the effectiveness of the GC2MS-2 alignment algorithm for heterogeneous datasets (with different temperature gradients).

ACS Paragon Plus Environment

19

Page 21 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Figure 5. The scatter plots between the first and second RT are presented after aligning 3 chromatograms for the dataset with a temperature gradient of 5°C/min.

ACS Paragon Plus Environment

20

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 28

Figure 6. The scatter plots between the first and second RTs are presented after aligning 9 chromatograms for each dataset with temperature gradients of 3°C, 5°C and 7°C/min using (a) GC2MS1 and (b) GC2MS-2. (c) Average Euclidean distance of each peak after alignment using the two approaches. ACS Paragon Plus Environment

21

Page 23 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Metabolomics of human breast cancer. Twenty-eight GCxGC/TOF-MS datasets, including twentyfive human plasma samples and a QC sample with three technical replicates, are analyzed by GC2MS. The three QC replicates are injected at the 1st, 14th, and 28th injections, and the second QC replicate is selected as the reference sample. The 28 GCxGC/TOF-MS datasets (netCDF format) are uploaded to the GC2MS server for baseline correction, blob detection, and alignment. After finishing the processes, GC2MS sends the user a notification e-mail, and the results can then be downloaded from the GC2MS server. The downloaded files include a peak table, and two tables containing the retention time of each blob of each sample. In this breast cancer study, 190 blobs are detected by GC2MS, 164 of which are detected in two or three QC replicates. The 164 blobs are used to perform partial least squaresdiscriminant analysis (PLS-DA), and the PLS-DA score plot is shown in Figure 7. In the PLS-DA score plot, the breast cancer and healthy samples are clearly separated. The levels of 5 blobs show a significant difference between breast cancer subjects and healthy volunteers (with a p value of the t-test less than 0.05). To identify the metabolites, the spectra of the 5 blobs of the second QC replicate are compared to the LECO/Fiehn metabolomics library (the spectra similarity calculation for the identification is shown in the SI). Three blobs with significant differences are identified, and the p values, similarity scores, and retention time are shown in Table S-1. The intensity levels of the 3 metabolites are shown in Figure 8.

ACS Paragon Plus Environment

22

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 28

Figure 7. PLS-DA score plot to distinguish breast cancer subjects from healthy volunteers. The PLSDA was conducted based on 164detected blobs. The results with three components can separate breast cancer subjects (black) from healthy volunteers (red).

Figure 8. Box plots of the significant metabolites.

CONCLUSIONS A free web-based platform, GC2MS, for GCxGC/TOF-MS data analysis is proposed and developed by directly adopting netCDF data obtained from an instrument as the input to detect blobs on each GCxGC/TOF-MS run and to align the blobs using the Euclidean distance and Pearson’s correlation ACS Paragon Plus Environment

23

Page 25 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

coefficient. GC2MS is a fully functional GCxGC/TOF-MS-based data analysis platform that incorporates baseline correction, peak detection, alignment and visualization. The effectiveness and performance of the GC2MS alignment algorithm was evaluated using existing datasets acquired under different experimental conditions, and the algorithm was reliable and useful under different experimental environments.

ACKNOWLEDGMENTS This work was funded by the Ministry of Science and Technology, Taiwan, grant numbers 103-2325-B002-048-, 104-2321-B-002-037-, and 104-2325-B-400-014-. The resources of the Laboratory of Computational Molecular Design and Detection, Department of Computer Science and Information Engineering, National Taiwan University were used to perform these studies.

ACS Paragon Plus Environment

24

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 28

REFERENCES (1)

Mondello, L.; Tranchida, P. Q.; Dugo, P.; Dugo, G. Mass Spectrom Rev 2008, 27, 101-24.

(2)

David, F.; Tienpont, B.; Sandra, P. J Sep Sci 2008, 31, 3395-403.

(3)

Guo, X.; Lidstrom, M. E. Biotechnol Bioeng 2008, 99, 929-40.

(4)

Mohler, R. E.; Dombek, K. M.; Hoggard, J. C.; Pierce, K. M.; Young, E. T.; Synovec, R. E.

Analyst 2007, 132, 756-67. (5)

Mohler, R. E.; Dombek, K. M.; Hoggard, J. C.; Young, E. T.; Synovec, R. E. Anal Chem 2006,

78, 2700-9. (6)

Mohler, R. E.; Tu, B. P.; Dombek, K. M.; Hoggard, J. C.; Young, E. T.; Synovec, R. E. J

Chromatogr A 2008, 1186, 401-11. (7)

O'Hagan, S.; Dunn, W. B.; Knowles, J. D.; Broadhurst, D.; Williams, R.; Ashworth, J. J.;

Cameron, M.; Kell, D. B. Anal Chem 2007, 79, 464-76. (8)

Shellie, R. A.; Welthagen, W.; Zrostlikova, J.; Spranger, J.; Ristow, M.; Fiehn, O.;

Zimmermann, R. J Chromatogr A 2005, 1086, 83-90. (9)

Sinha, A. E.; Hope, J. L.; Prazen, B. J.; Nilsson, E. J.; Jack, R. M.; Synovec, R. E. J Chromatogr

A 2004, 1058, 209-15. (10)

Tranchida, P. Q.; Costa, R.; Donato, P.; Sciarrone, D.; Ragonese, C.; Dugo, P.; Dugo, G.;

Mondello, L. J Sep Sci 2008, 31, 3347-51. (11)

Li, X.; Xu, Z.; Lu, X.; Yang, X.; Yin, P.; Kong, H.; Yu, Y.; Xu, G. Anal Chim Acta 2009, 633,

257-62. (12)

Hope, J. L.; Prazen, B. J.; Nilsson, E. J.; Lidstrom, M. E.; Synovec, R. E. Talanta 2005, 65, 380-

8. (13)

Kusano, M.; Fukushima, A.; Kobayashi, M.; Hayashi, N.; Jonsson, P.; Moritz, T.; Ebana, K.;

Saito, K. J Chromatogr B Analyt Technol Biomed Life Sci 2007, 855, 71-9.

ACS Paragon Plus Environment

25

Page 27 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

(14)

Perera, R. M.; Marriott, P. J.; Galbally, I. E. Analyst 2002, 127, 1601-7.

(15)

Pierce, K. M.; Hope, J. L.; Hoggard, J. C.; Synovec, R. E. Talanta 2006, 70, 797-804.

(16)

Whitener, M. E. B.; Stanstrup, J.; Panzeri, V.; Carlin, S.; Divol, B.; Du Toit, M.; Vrhovsek, U.

Metabolomics 2016, 12. (17)

Sweetman, L.; Ashcraft, P.; Bennett-Firmin, J. Methods Mol Biol 2016, 1378, 183-97.

(18)

Shi, X.; Wei, X. L.; Yin, X. M.; Wang, Y. H.; Zhang, M.; Zhao, C. Q.; Zhao, H. Y.; McClain, C.

J.; Feng, W. K.; Zhang, X. Journal of Proteome Research 2015, 14, 1174-1182. (19)

Zhang, D.; Huang, X.; Regnier, F. E.; Zhang, M. Anal Chem 2008, 80, 2664-71.

(20)

Ramos, L., Comprehensive Two Dimensional Gas Chromatography. Elsevier, Oxford, UK:

2009. (21)

Koek, M. M.; Muilwijk, B.; van Stee, L. L.; Hankemeier, T. J Chromatogr A 2008, 1186, 420-9.

(22)

Kouremenos, K. A.; Harynuk, J. J.; Winniford, W. L.; Morrison, P. D.; Marriott, P. J. J

Chromatogr B Analyt Technol Biomed Life Sci 2010, 878, 1761-70. (23)

Froehlich, J. E.; Wilkerson, C. G.; Ray, W. K.; McAndrew, R. S.; Osteryoung, K. W.; Gage, D.

A.; Phinney, B. S. J Proteome Res 2003, 2, 413-25. (24)

Hu, S.; Xie, Y.; Ramachandran, P.; Ogorzalek Loo, R. R.; Li, Y.; Loo, J. A.; Wong, D. T.

Proteomics 2005, 5, 1714-28. (25)

Prazen, B. J. S., R. E.; Kowalski, B. R Anal. Chem. 1998, 70.

(26)

Fraga, C. G.; Prazen, B. J.; Synovec, R. E. Anal Chem 2001, 73, 5833-40.

(27)

van Mispelaar, V. G.; Tas, A. C.; Smilde, A. K.; Schoenmakers, P. J.; van Asten, A. C. J

Chromatogr A 2003, 1019, 15-29. (28)

Pierce, K. M.; Wood, L. F.; Wright, B. W.; Synovec, R. E. Anal Chem 2005, 77, 7735-43.

(29)

Oh, C.; Huang, X.; Regnier, F. E.; Buck, C.; Zhang, X. J Chromatogr A 2008, 1179, 205-15.

(30)

Wang, B.; Fang, A.; Heim, J.; Bogdanov, B.; Pugh, S.; Libardoni, M.; Zhang, X. Anal Chem

2010, 82, 5069-81. (31)

Kim, S.; Fang, A.; Wang, B.; Jeong, J.; Zhang, X. Bioinformatics 2011, 27, 1660-6. ACS Paragon Plus Environment

26

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(32)

Page 28 of 28

R Core Team, R: A Language and Environment for Statistical Computing. Vienna, Austria,

2010. (33)

Reichenbach, S. E.; Ni, M.; Zhang, D.; Ledford, E. B., Jr. J Chromatogr A 2003, 985, 47-56.

ACS Paragon Plus Environment

27

Page 29 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

FOR TOC ONLY

ACS Paragon Plus Environment

28