ADAP-GC 3.2: Graphical Software Tool for Efficient ... - ACS Publications

Oct 27, 2017 - ADAP-GC 3.2: Graphical Software Tool for Efficient Spectral. Deconvolution of Gas ... University of Hawaii Cancer Center, Honolulu, Haw...
0 downloads 8 Views 2MB Size
Subscriber access provided by READING UNIV

Article

ADAP-GC 3.2: Graphical Software Tool for Efficient Spectral Deconvolution of Gas Chromatography-High Resolution Mass Spectrometry Metabolomics Data Aleksandr Smirnov, Wei Jia, Douglas I. Walker, Dean P. Jones, and Xiuxia Du J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.7b00633 • Publication Date (Web): 27 Oct 2017 Downloaded from http://pubs.acs.org on October 28, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

ADAP-GC 3.2: Graphical Software Tool for Efficient Spectral Deconvolution of Gas Chromatography-High Resolution Mass Spectrometry Metabolomics Data Aleksandr Smirnov,† Wei Jia,‡ Douglas I. Walker,¶ Dean P. Jones,¶ and Xiuxia Du∗,† †University of North Carolina at Charlotte, Charlotte, North Carolina 28223, United States ‡University of Hawaii Cancer Center, Honolulu, Hawaii 96813, United States ¶Emory University, Atlanta, Georgia 30322, United States E-mail: [email protected] Phone: (704) 687-7307

9201 University City Blvd., Charlotte, NC 28223, USA

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Abstract ADAP-GC is an automated computational workflow for extracting metabolite information from raw, untargeted gas chromatography mass spectrometry metabolomics data. Deconvolution of co-eluting analytes is a critical step in the workflow and the underlying algorithm is able to extract fragmentation mass spectra of co-eluting analytes with high accuracy. However, its latest version ADAP-GC 3.0 was not user-friendly. To make ADAP-GC easier to use, we have developed ADAP-GC 3.2 and describe here the improvements on three aspects. First, all of the algorithms in ADAP-GC 3.0 written in R have been replaced by their analogues in Java and incorporated into MZmine 2 to make the workflow user-friendly. Second, the clustering algorithm DBSCAN has replaced the original hierarchical clustering to allow faster spectral deconvolution. Finally, algorithms originally developed for constructing EICs and detecting EIC peaks from LC/MS data are incorporated into the ADAP-GC workflow allowing the latter to process high mass resolution data. Performance of ADAP-GC 3.2 has been evaluated using unit mass resolution data from standard-mixture and urine samples. The identification and quantitation results were compared to those produced by ADAP-GC 3.0, AMDIS, AnalyzerPro, and ChromaTOF. Identification results for high mass resolution data derived from standard-mixture samples are presented as well.

Keywords Spectral deconvolution, compound identification, compound quantitation, high mass resolution, gas chromatography, mass spectrometry, metabolomics, computational workflow, software, visualization

Introduction The ADAP-GC workflow is designed to preprocess raw, untargeted, gas chromatography mass spectrometry (GC/MS) metabolomics data. 1–3 It carries out a sequence of computational tasks

2 ACS Paragon Plus Environment

Page 2 of 24

Page 3 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

that include construction of extracted ion chromatograms (EICs, also called XICs), detection of peaks from EICs, spectral deconvolution, and alignment of analytes across samples. Since the first version of the workflow, its algorithms have been constantly improved in terms of accuracy of extracting metabolite information and sensitivity of detecting low concentration compounds. In particular, the deconvolution algorithm has undergone two major updates and has demonstrated similar or better performance in comparison to existing software tools for identifying and quantifying co-eluting metabolites based on unit mass resolution GC/MS metabolomics data. 3 However, the spectral deconvolution algorithms introduced in the two major updates – ADAPGC 2.0 and ADAP-GC 3.0 – were written purely in the programming language R and thereby are limited in processing speed. 2,3 To overcome this limitation, we have replaced the spectral deconvolution algorithm by its analogue written in Java. Programming language Java not only speeds up the computations considerably, but also makes it easy to integrate ADAP-GC with the widely used software MZmine 2, which is written in Java as well and has a modular software design simplifying the integration. 4–6 This integration allows ADAP-GC users to take advantage of the strengths of MZmine 2, including rich visualization and ability to preprocess raw data in multiple open data formats and export preprocessing results to multiple formats. As a result, ADAP-GC becomes a full-fledged and user-friendly graphical software tool. In the meantime, ADAP-GC benefits MZmine 2 by equipping the latter with the ability to preprocess GC/MS data. Before ADAP-GC was integrated with MZmine 2, the latter was unable to perform spectral deconvolution of GC/MS data and, as a result, was primarily used for preprocessing LC/MS data. In addition to the reimplementation of ADAP-GC algorithms in Java and integration with MZmine 2, a number of significant changes have been made to the ADAP-GC workflow. First, all algorithms in the previous version of ADAP-GC were designed to perform deconvolution of GC/MS data at unit mass resolution. With more GC/MS metabolomics data acquired at high mass resolution, it has become necessary to update relevant algorithms to handle such data. Toward this end, we have used two algorithms originally developed for constructing EICs and detecting EIC peaks from high mass resolution, liquid chromatography (LC) MS data. 7,8 These algorithms

3 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 24

are part of the ADAP software library developed in our laboratory and have demonstrated their ability to handle high mass resolution GC/MS data as well. Furthermore, we have improved the spectral deconvolution algorithm by reducing the number of computational steps and the associated user-defined parameters, which makes the algorithm less prone to deconvolution errors due to improperly-set parameters. The improved workflow is versioned ADAP-GC 3.2. Its deconvolution performance has been evaluated using both unit and high mass resolution data and compared to that of its predecessor ADAP-GC 3.0 as well as AMDIS, 9,10 AnalyzerPro, 11 and ChromaTOF. 12 The source code, data files used for testing it, and the results it produced can all be accessed via http://www.du-lab.org.

Experimental Section Unit mass resolution TOF GC/MS datasets.

Two sets of unit mass resolution data files were

used in the development of ADAP-GC 3.2. After TMS derivatization, each 1 µL aliquot of the derivatized solution was injected in splitless mode into an Agilent 6890N GC system (Santa Clara, CA, USA) that was coupled with a Pegasus HT TOF-MS (LECO Corporation, St. Joseph, MI, USA). Separation was achieved on a DB-5 ms capillary column (30 m × 250 µm I.D., 0.25 µm film thickness; Agilent J&W Scientific, Folsom, CA, USA), with helium as the carrier gas at a constant flow rate of 1.0 mL/ min. The temperature of injection, transfer interface, and ion source was set to 260 ◦C, 260 ◦C, and 210 ◦C, respectively. The GC temperature programming was set to 2 min isothermal heating at 80 ◦C, followed by 10 ◦C/ min oven temperature ramps to 220 ◦C, 5 ◦C/ min to 240 ◦C, and 25 ◦C/ min to 290 ◦C, and a final 8 min maintenance at 290 ◦C. Electron impact ionization (70 eV) at full scan mode (m/z 40-600) was used, with an acquisition rate of 20 spectra per second in the TOF/MS setting. (1) Mixture of standard compounds (Sample I): Seven calibration curve samples with each containing 27 standard compounds were prepared at different concentrations ((0.2, 0.4, 0.6, 0.8, 1.0, 2.0 and 5.0) µg/mL of each compound). With four pairs of co-eluting compounds in each

4 ACS Paragon Plus Environment

Page 5 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

sample, we were able to evaluate the overall performance of peak detection and deconvolution of ADAP-GC 3.2. (2) Urine samples with standard mixtures spiked in (Sample II): Sample II was prepared by spiking into a pooled urine sample with the seven calibration curve samples of Sample I and an additional sample consisting of 0.1 µg/mL of each standard compound. Sample II was used for evaluating the performance of ADAP-GC 3.2 in terms of processing complex samples. High mass resolution Orbitrap GC/MS datasets. A series of 16 distinct mixtures containing a total of 260 standard compounds were analyzed for testing the detection and identification of environmental pollutants using gas chromatography with high-resolution accurate mass detection. These standard compounds include brominated flame retardants, dioxins, furans, polychlorinated biphenyls, organonitrogen pesticides, pyrethroids, organophosphorous pesticides, and organochlorine pesticides. Each mixture was prepared in isooctane, with concentrations ranging from 100 to 2000 ng/mL and containing up to 34 unique compounds. For each sample, a 3 µL aliquout of sample was injected in splitless mode and analyte separation was accomplished by 15 m × 0.25 mm × 0.25 m DB-5ms Ultra Inert column using a Trace 1310 GC (Thermo Scientific). Ultra-high purity helium was used as the carrier gas with the following temperature program: hold for 1 min at 100 ◦C, increase at 20 ◦C/ min to 180 ◦C and hold for 1 min, increase at 4 ◦C/ min to 250 ◦C and hold for 1 min, final increase to 300 ◦C at 20 ◦C/ min and hold for 3 min (total run time 30 min). Accurate mass was detected by Q-Exactive GC hybrid quadrupole-Orbitrap GC-MS/MS (Thermo Scientific) with EI source and operated in full scan mode (scan range of 85–850 m/z) at resolution of 60,000 (FWHM). Data was collected and stored in profile mode, which excluded a 3 min solvent delay at the beginning of the run.

Results and Discussion

5 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 6 of 24

Page 7 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(i) each window contains the entirety of EIC peaks produced by the same analyte or by co-eluting analytes, and (ii) each window contains a much smaller number of EIC peaks in comparison to the total number of EIC peaks in the entire data file. Subsequent deconvolution steps are carried out separately in each window so that the deconvolution algorithms are not overwhelmed with a large number of EIC peaks. Deconvolution within each window starts with two sequential clustering phases applied to EIC peaks. The first-phase clustering is based on the proximity of peak apexes in the time domain and each resulting cluster indicates the presence of at least one analyte. Since co-eluting analytes are in close proximity of each other and could fall in the same cluster, simple comparison of retention times cannot detect all co-eluting analytes. Detection of these analytes could be achieved by using elution profiles. Toward this end, a second-phase clustering that is based on the elution profiles of EIC peaks is carried out to group unique EIC peaks from each first-phase cluster. As a result, each first-phase cluster can be split into one, two, or more smaller clusters and each resulting cluster indicates the presence of one single analyte. From each second-phase cluster, a model peak is selected that can best represent the elution profile of the corresponding analyte. Since an observed EIC peak can be produced by two or more co-eluting analytes, the fragmentation spectra of the detected analytes are constructed by decomposing every observed EIC peak into a linear combination of model peaks. Details of these deconvolution steps can be found in the previous publications. 1–3 The deconvolution procedure described above has demonstrated good performance in terms of the accuracy of qualitative and quantitative compound information extracted from the data. 3 However, the procedure could run very slow due to wide deconvolution windows. Wide deconvolution windows tend to contain a large number of analytes, which causes the slowdown of the deconvolution procedure. Previous versions of the ADAP-GC workflow determine deconvolution windows using peak picking results from the total ion chromatogram (TIC). However, TIC may not reflect the presence of individual EIC peaks equally. This is because TIC is the summation of all EICs and the con-

7 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 8 of 24

Page 9 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

parameters depending on the type of analyzed data. Moreover, they might need to find new appropriate parameters for TIC peak detection when they work with a type of data that the deconvolution procedure has not been tested on. For a better solution, we have eliminated the explicit step of determining deconvolution windows, but achieve the same goal by clustering the retention time of all of the EIC peak apexes in the entire data file. Specifically, in the first-phase clustering, the complete-linkage hierarchical clustering is replaced with the clustering algorithm DBSCAN – Density-Based Spatial Clustering of Applications with Noise – which partitions retention times of EIC peaks into dense regions separated by sparse regions 13 (Figure 3A). Compared to hierarchical clustering, DBSCAN has an advantage in computational efficiency. The runtime complexities of the hierarchical clustering and DBSCAN algorithms are O(n2 ) and O(n log n) respectively, so DBSCAN-clustering is significantly faster than hierarchical clustering. As a specific example, we have clustered the retention time of all of the 39,843 EIC peaks from data file U 50S5_1 using the Python package scikit-learn. 14 The hierarchical and DBSCAN clustering took 64.37 and 1.78 seconds respectively. Therefore, DBSCAN can efficiently cluster the apex retention time of all of the EIC peaks spanning the entire retention time range in a typical metabolomics data file, whereas the same clustering task would take much longer time by hierarchical clustering. In addition to this advantage in computing efficiency, DBSCAN can better handle clusters of different sizes and is better resistant to stand-alone EIC peaks that are usually considered as noise. During the decomposition step, every EIC peak is decomposed into a linear combination of model peaks. In order to improve the overall runtime of the deconvolution procedure, the new algorithm decomposes an EIC peak into a linear combination of only those model peaks that overlap with the EIC peak, thus keeping the number of model peaks participating in the decomposition low. This is in contrast to the implementations of the ADAP-GC 2.0 and 3.0 workflows, where all of the model peaks within the deconvolution window participate in the decomposition. Figure 3 demonstrates main steps of ADAP-GC 3.2 spectral deconvolution using a unit mass resolution dataset U50S0.2_1.cdf. First, dense clusters of peaks in the time domain are detected

9 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 10 of 24

Page 11 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

using DBSCAN. The result produced by DBSCAN is shown in Figure 3A, where each dot represents a peak apex with the retention time and m/z value as its coordinates. This is a quick but coarse clustering, and it may produce clusters containing EIC peaks that correspond to more than one analyte. For instance, all peaks in cluster (i) in Figure 3A correspond to one analyte, whereas cluster (ii) contains peaks produced by two analytes. Unique peaks in clusters (i) and (ii) are shown in Figure 3B, where different colors indicate the groups of EIC peaks determined by the second clustering based on similarity of their elution profiles. All unique EIC peaks in cluster (i) are produced by the same analyte and stay in the same cluster after the second clustering. However, unique EIC peaks in cluster (ii) are split into two clusters, named (iii) and (iv). The model peaks in clusters (i), (iii), and (iv) are unique EIC peaks of the highest sharpness. Figure 3C demonstrates the fragmentation spectra constructed for the clusters (iii) and (iv) by decomposing every EIC peak into a linear combination of the model peaks 154 and 156. These spectra are matched against inhouse library spectra (shown in red in Figure 3C) for L-Histidine and L-Lysine with the score 803 and 965 respectively.

Results. First, we compare the performance of ADAP-GC 3.2 to its predecessor, ADAP-GC 3.0, on the unit mass resolution TOF GC/MS datasets. We choose the same datasets that were used by Yan Ni et al. 3 The first dataset consists of 7 samples of standard compounds at different concentrations and the second dataset consists of 8 urine samples with the same standard compounds spiked in at different concentrations. In order to identify compounds, we use an in-house library of fragmentation mass spectra obtained for the standard compounds on the same GC/MS equipment. The matching score is calculated by the formula used by Yan Ni et al. 3 to be consistent with the identification results produced by ADAP-GC 3.0. Note that during spectral deconvolution, EIC peaks with m/z values 73, 147, and 221 were excluded from the list of model peak candidates and, therefore, could not be chosen as model peaks. These peaks are typically produced by derivatizing reagents and, therefore, contained in the spectrum of every analyte. As a result, they have a high chance to be composite and can not represent the elution profile of a single analyte.

11 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 24

Table 1: Identification results for unit mass resolution datasets. No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 13(2) 15 16 17 18 19 20 21 22 23 24 25 26 27

Compound Name Pyruvic acid Propanoic acid β-Amino isobutyric acid L-Norleucine Alloisoleucine Proline Glyceric acid Threonine 5-oxoproline L-Cysteine Creatinine Citrulline d-Xylose Asparagine d-Xylose 1,4-Butanediamine Glycerolphosphate Chlorophenylalanine Citric acid Isocitric acid L-Histidine L-Lysine Mannitol Galic acid N-Acetyl glucosamine methoxime L-Tryptophan Adenosine Guanosine Average Value

† ‡

Standard Mixture (7 samples)

Urine (8 samples)

RT (min)

Mass

R2

Score

Count

RT (min)

Mass

5.17 5.34 7.47 8.40 8.73 8.78 9.34 10.31 12.80 13.57 13.57 14.84 15.93 16.14 16.16 17.59 18.51 18.95 19.81 19.87 21.92 21.96 22.61 22.87 25.97 27.94 31.38 32.31

174 117 102 158 158 142 189 57 156 115 115 70 103 116 103 174 73 218 273 245 154 174 103 281 87 202 230 103

0.999 0.999 0.999 0.999 0.998 0.999 0.998 0.997 0.999 1.000 1.000 0.997 0.999 0.997 0.994 0.996 0.998 NA† 0.998 0.997 0.994 0.994 0.996 0.998 1.000 0.996 0.994 0.991

934 981 931 901 856 979 975 975 927 839 869 956 947 744 963 958 902 956 947 902 880 958 947 974 903 966 900 815

7 7 7 7 7 7 7 7 7 2 5 7 7 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7

5.16 5.34 7.48 8.40 8.74 8.78 9.34 10.31 12.81 13.52 13.59 14.85 15.94 16.16 16.17 17.60 18.52 18.96 19.85 19.89 21.95 21.97 22.63 22.88 25.96 27.94 31.38 32.31

174 117 102 158 158 142 189 57 156 292 115 142 103 116 103 174 299 218 273 245 154 174 103 281 87 202 230 324

0.997

921



Score

Count

0.977 0.996 0.878 0.998 0.998 0.997 0.996 0.991 0.995 0.316 0.342 0.994 0.993 0.964 0.998 0.999 0.995 NA† 0.838 0.976 0.958 0.992 0.957 0.967 0.991 0.987 0.995 0.990

944 979 900 837 873 965 974 979 886 729 967 920 850 774 963 957 943 893 986 891 933 965 947 950 912 906 937 906

8 8 8 8 8 8 8 8 8 8 8 8 8 7 8 8 8 8 8 8 8 8 8 8 8 8 8 8

0.929

917

R2

Chlorophenylalanine is used as an internal standard for quantitation, so its R2 values are not available R2 -values for the urine samples are approximate since precise concentrations of the compounds could not be determined a priori.

12 ACS Paragon Plus Environment

Page 13 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table 1 lists the identification results produced by ADAP-GC 3.2. Compared to the results reported by Yan Ni et al., 3 we see highly similar identification results: the average matching scores produced by ADAP-GC 3.0 and 3.2 are, respectively, 917 and 921 for the standard-mixture dataset and 903 and 917 for the urine dataset, while the average R2 -values produced by the two version are, respectively, 0.998 and 0.997 for the first dataset, and 0.926 and 0.929 for the second (see the reference 3 for ADAP-GC 3.0 results). Thus, we can confirm that the new ADAP-GC 3.2 workflow produces identification results as good as the ones obtained by its previous version. In Figure 4AD, we plot the fragmentation mass spectra constructed by ADAP-GC 3.0 and 3.2. We choose two pairs of co-eluting compounds — Asparagine and d-Xylose, Citric Acid and Isocitric Acid — and demonstrate that the constructed spectra are almost identical to the ones produced by ADAP-GC 3.0. More plots comparing fragmentation mass spectra constructed by ADAP-GC 3.0 and 3.2 can be found in Supporting Information. Next, we summarize the identification results produced by ADAP-GC 3.2 and 3.0, AMDIS, AnalyzerPro, and ChromaTOF in Table 2. The results from ADAP-GC 3.0 and AMDIS were reported by Yan Ni et al. 3 and listed here without changes since the version of AMDIS did not change since the last ADAP-GC publication. 3 The results from AnalyzerPro and ChromaTOF were recalculated with the newest versions of these software tools – their parameters and identification for each compound can be found in Supporting Information. Although the identification results are slightly different from those reported by Yan Ni et al., 3 the overall performance of the compared tools is the same: ADAP-GC (both 3.0 and 3.2) and ChromaTOF produce similar results in terms of the number of identified compounds, their matching scores, and R2 -values, while AMDIS and AnalyzerPro tend to miss certain compounds. In the standard mixture samples, the pair of coeluting compounds L-Cysteine and Creatinine could not be analytically resolved by any of the software tools. However, the co-eluting compounds L-Histidine and L-Lysine were analytically resolved by ADAP-GC 3.0, ADAP-GC 3.2, and ChromaTOF, but could not be resolved by AMDIS and AnalyzerPro in the standard mixture (see Supporting Information). In order to test the performance of ADAP-GC 3.2 on high mass resolution data, the workflow

13 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 14 of 24

Page 15 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table 2: Comparison of the identification results Standard Mixture (196 compounds) R2

Identified Score ADAP-GC 3.2 ADAP-GC 3.0 AMDIS AnalyzerPro ChromaTOF

188 189 179 172 188

921 917 899 869 902

0.997 0.998 0.990 0.996 0.996

Urine (224 compounds) Identified Score 223 224 221 217 220

917 903 906 884 906

R2† 0.929 0.926 0.803 0.914 0.919



R2 -values for the urine samples are approximate since precise concentrations of the compounds could not be determined a priory

has been used for preprocessing 16 data files generated on a high-resolution, accurate-mass Orbitrap (ThermoFisher Scientific) coupled to a GC system. The workflow is applied to each data file to detect presence of analytes and construct the fragmentation spectra for each analyte. The resulting spectra are matched against the NIST14 EI mass spectral library 15 by using the NIST MS Search software that comes with the library. In addition, the program MS PepSearch, 16 the console counterpart of MS Search, is used so that the library search results can be saved to a text file for subsequent automatic parsing. In both programs, the simple similarity matching score without the reverse-search option is used. We consider a compound to be identified if its matching score exceeds 800. Note that we do not make any restrictions on the position of a compound in the matching hit list since many compounds in the dataset belong to the groups of polychlorinated biphenyl (PCB) congeners, polybrominated diphenyl ethers (PBDEs), and hexachlorocyclohexanes (BHC) that are difficult to distinguish based solely on their fragmentation mass spectra. Therefore, those compounds may appear low in the hit list among other compounds from the same group but have a high matching score. For instance, two compounds with fragmentation mass spectra plotted in Figures 4E-F have high matching scores but appear as #25 and #65 in the matching hit lists. The identification results are presented in Table 3. We report a total of 250 compounds instead of 260 contained in the mixtures since the other ten compounds are not in the NIST14 EI spectral library and, therefore, could not be identified regardless of what data preprocessing algorithms are

15 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 24

used. Out of the 250 compounds, the ADAP-GC 3.2 workflow was able to successfully identify 206 of them with the matching score 800 and above, while 36 other compounds have the matching scores between 500 and 800 (see Figure 5). Finally, 8 compounds could not be found in the raw MS data due to either low concentrations or co-elution with other compounds. In the Supporting Information, we provide parameters that ADAP-GC 3.2 used for preprocessing these data files, details about each compound that has been identified, and plots of the mass spectra constructed by ADAP-GC 3.2 and the corresponding mass spectra in the NIST14 EI spectral library. Table 3: Identification results for high mass resolution datasets. Sample

Number of compounds

Number of identified compounds

PCB_Content_Eval_Mix1 PCB_Content_Eval_Mix2 PCB_congener_calibration_mix PBB153 PBDE_Tech_Mixes Dioxins Furans Pest_mix_08 Pest_mix_09 Pest_mix_10 Pest_mix_11 Pest_mix_12 Pest_mix_13 Pest_mix_14 Pest_mix_15 Pest_mix_16

6 3 14 1 6 4 4 16 40 25 28 35 27 9 24 8

6 3 14 1 6 2 3 13 32 13 23 31 22 9 21 7

Total

250

206

MZmine 2 Interface to ADAP-GC 3.2. ADAP-GC 3.2 has been implemented in Java and incorporated into the framework of MZmine 2. Two modules — ADAP Chromatogram builder for constructing EICs and Wavelets (ADAP) for detecting EIC peaks — have been developed by Myers and incorporated into MZmine 2 earlier. 8 The module Spectral deconvolution is reported here. For

16 ACS Paragon Plus Environment

Page 17 of 24

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 18 of 24

Page 19 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

intensities divided by the apex intensity are less than parameter Min edge-to-height ratio (0.3 in our tests); (ii) the difference between the left and right boundary intensities divided by the apex intensity is less than parameter Min delta-to-height ratio (0.3 in our tests); (iii) its sharpness exceeds parameter Min sharpness. Here, sharpness of an EIC peak is proportional to its intensity, but also depends on the peak width and shape. Therefore, parameter Min sharpness needs to be determined empirically so that noisy, distorted, or chromatographically unresolved EIC peaks would not be displayed in the bottom-right panel. The next parameter Shape-similarity tolerance affects the second-phase clustering. Shapesimilarity between two EIC peaks is calculated as the angle between their intensity-vectors and ranges from 0◦ to 90◦ . The parameter should be adjusted to separate co-eluting analytes as displayed on the bottom-right panel and is usually set from 20 to 30 degrees. If this parameter is too large, the algorithm will not be able to separate co-eluting compounds and some of them can be missed. If this parameter is too small, the algorithm can produce many false analytes that would affect the identification and quantitation of the correct analytes. The result of the second-phase clustering is displayed on the bottom-right panel (Figure 6), where each cluster is distinguished by its color. After the second-phase clustering is performed, the best EIC peak in each cluster is chosen to represent the elution profile of an analyte. This choice is based either on the sharpness of EIC peaks or their m/z values. EIC peaks with high sharpness typically have a very smooth undistorted shape, while EIC peak with high m/z values have a smaller chance to be produced by co-eluting analytes. In our tests, the choice of EIC peaks to represent the elution profiles of analytes is based on the sharpness (similarly to ADAP-GC 3.0). Another visualization module displays mass spectra constructed by the deconvolution process in the context of the raw mass spectra. As an example shown in Figure 7, the raw spectrum is shown in blue while the constructed spectrum is shown in green. The two spectra are shown in the head-to-tail mode, similar to the method adopted in the NIST MS Search program. All of the constructed spectra can be exported in either .msp or .mgf (Mascot Generic Format) format.

19 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 20 of 24

Page 21 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Supporting Information Available The following files are available free of charge at ACS website http://pubs.acs.org: Supporting Information 1: Parameters for preprocessing unit mass resolution datasets; Supporting Information 2: Comparison of spectra produced by ADAP-GC 3.0 and 3.2; Supporting Information 3: AnalyzerPro and ChromaTOF parameters and compound identification results; Supporting Information 4: Parameters for preprocessing high mass resolution datasets; Supporting Information 5: Compound identification results from high mass resolution datasets. This material is available free of charge via the Internet at http://pubs.acs.org/.

Acknowledgement The authors thank the National Science Foundation (NSF) Award 1262416, National Institute of Environmental Health Sciences (NIEHS) P50ES026071, P30ES019116, U2CES026560, and the United States Environmental Protection Agency (EPA) 83615301 for funding this research and development. In addition, we thank Dr. Brian T. Cooper at the University of North Carolina at Charlotte for insightful discussions regarding using NIST PepSearch, thank Dr. Tomáš Pluskal at the Whitehead Institute for Biomedical Research for his big help with incorporating ADAP-GC 3.2 into MZmine 2, and thank Dr. Yan Ni at the University of Hawaii Cancer Center for the help with running ADAP-GC 3.0.

References (1) Jiang, W.; Qiu, Y.; Ni, Y.; Su, M.; Jia, W.; Du, X. An automated data analysis pipeline for GC-TOF-MS metabonomics studies. J Proteome Res 2010, 9, 5974–81. (2) Ni, Y.; Qiu, Y.; Jiang, W.; Suttlemyre, K.; Su, M.; Zhang, W.; Jia, W.; Du, X. ADAP-GC 2.0:

21 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 24

deconvolution of coeluting metabolites from GC/TOF-MS data for metabolomics studies. Anal Chem 2012, 84, 6619–29. (3) Ni, Y.; Su, M.; Qiu, Y.; Jia, W.; Du, X. ADAP-GC 3.0: Improved Peak Detection and Deconvolution of Co-eluting Metabolites from GC/TOF-MS Data for Metabolomics Studies. Anal Chem 2016, 88, 8802–11. (4) Katajamaa, M.; Miettinen, J.; Oresic, M. MZmine: toolbox for processing and visualization of mass spectrometry based molecular profile data. Bioinformatics 2006, 22, 634–6. (5) Pluskal, T.; Castillo, S.; Villar-Briones, A.; Oresic, M. MZmine 2: modular framework for processing, visualizing, and analyzing mass spectrometry-based molecular profile data. BMC Bioinformatics 2010, 11, 395. (6) MZmine 2. http://mzmine.github.io/, [Accessed March 20, 2017]. (7) ADAP 3.1. http://www.du-lab.org/software.html, [Accessed June 28, 2017]. (8) Myers, O. D.; Sumner, S. J.; Li, S.; Barnes, S.; Du, X. One Step Forward for Reducing False Positive and False Negative Compound Identifications from Mass Spectrometry Metabolomics Data: New Algorithms for Constructing Extracted Ion Chromatograms and Detecting Chromatographic Peaks. Anal Chem 2017, 89, 8696–8703. (9) AMDIS. http://chemdata.nist.gov/dokuwiki/doku.php?id=chemdata: amdis, [Accessed June 28, 2017]. (10) Stein, S. E. An integrated method for spectrum extraction and compound identification from gas chromatography/mass spectrometry data. Journal of the American Society for Mass Spectrometry 1999, 10, 770–781. (11) AnalyzerPro. https://www.spectralworks.com/analyzerpro.html, cessed October 24, 2017].

22 ACS Paragon Plus Environment

[Ac-

Page 23 of 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(12) ChromaTOF.

https://www.leco.com/products/separation-science/

software-accessories/chromatof-software, [Accessed October 24, 2017]. (13) Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. 1996; pp 226–231. (14) SciKit-Learn. http://scikit-learn.org, [Accessed May 11, 2017]. (15) NIST Mass Spectral Library. http://www.sisweb.com/software/ms/nist. htm, [Accessed on April 28, 2017]. (16) MS

PepSearch.

http://chemdata.nist.gov/dokuwiki/doku.php?id=

peptidew:mspepsearch, [Accessed on April 5, 2017].

23 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

for TOC only

24 ACS Paragon Plus Environment

Page 24 of 24