Data Analysis Tool for Comprehensive Two ... - ACS Publications

Mar 24, 2011 - Chromatography/Time-of-Flight Mass Spectrometry. Sandra Castillo ... A further feature of the GCВGC technique is the ordered structure...
1 downloads 0 Views 3MB Size
ARTICLE pubs.acs.org/ac

Data Analysis Tool for Comprehensive Two-Dimensional Gas Chromatography/Time-of-Flight Mass Spectrometry Sandra Castillo, Ismo Mattila, Jarkko Miettinen, Matej Oresic, and Tuulia Hy€otyl€ainen* VTT Technical Research Centre of Finland, Espoo, FI-02044 VTT, Finland ABSTRACT: Data processing and identification of unknown compounds in comprehensive two-dimensional gas chromatography combined with time-of-flight mass spectrometry (GCGC/TOFMS) analysis is a major challenge, particularly when large sample sets are analyzed. Herein, we present a method for efficient treatment of large data sets produced by GCGC/ TOFMS implemented as a freely available open source software package, Guineu. To handle large data sets and to efficiently utilize all the features available in the vendor software (baseline correction, mass spectral deconvolution, peak picking, integration, library search, and signal-to-noise filtering), data preprocessed by instrument software are used as a starting point for further processing. Our software affords alignment of the data, normalization, data filtering, and utilization of retention indexes in the verification of identification as well as a novel tool for automated group-type identification of the compounds. Herein, different features of the software are studied in detail and the performance of the system is verified by the analysis of a large set of standard samples as well as of a large set of authentic biological samples, including the control samples. The quantitative features of our GCGC/TOFMS methodology are also studied to further demonstrate the method performance and the experimental results confirm the reliability of the developed procedure. The methodology has already been successfully used for the analysis of several thousand samples in the field of metabolomics.

C

omprehensive two-dimensional gas chromatography combined with time-of-flight mass spectrometry (GCGC/ TOFMS) is gaining acceptance as one of the most rapid and high-resolution systems available for the separation of (semi)volatile organic compounds.1 The GCGC/TOFMS technique offers not only superior separation efficiency but also an enhanced sensitivity due to concentrative modulation. A further feature of the GCGC technique is the ordered structure of the chromatograms, which can be utilized in the identification of unknown compounds. During the recent decade, this technique has proven its usefulness and reliability in various application areas, including that in the petroleum industry and flavors and fragrances and in environmental and food-related applications.2,3 The main challenge in GCGC/TOFMS analysis is currently the data processing, particularly when large sample sets are analyzed and/or comparative analyses of, e.g., metabolite profiles across multiple samples are needed. Typically, only a relatively small part of the data are utilized because taking into account all the information obtained by the 3D separation is far from trivial and handling chromatograms with several hundred or even thousands of peaks is challenging. Utilization of automated data analysis tools combined with chemometrical analysis allows more efficient data analysis than the conventional approaches. However, before any chemometrical analysis can be performed, preprocessing tools must be applied to the raw 2D data to correct for signal fluctuations r 2011 American Chemical Society

due to random error, instrumentation fluctuations, and detector noise. Typically, baseline subtraction, normalization, noise filtering, and retention time alignment are required. Although advances have been made in chemometric methodology, there is still a significant gap between the data collection and the data interpretation for 2D separations, i.e., in going from data to useful information.4 In any studies which require comparison of chemical profiles across a large set of samples, automated procedures that allow careful alignment of the data and flexible treatment of the data, such as filtration of the data with predetermined rules, are highly advantageous. Moreover, automated procedures that allow compound-type identification, i.e., whether the (unidentified) peak is an alkane, an aromatic, a carboxylic acid, or an amino acid, would be very useful in such studies. For preprocessing of the data, several approaches have been suggested. Shellie and colleagues developed a methodology in which a combination of chromatogram subtraction, averaging routines, weighting factors, and Student's t test to directly compare GCGC profiles of a sample against a reference chromatogram and making use of the compare function in the ChromaTOF software by Leco Corp. (St. Joseph, MI) was used.5 A similar Received: December 21, 2010 Accepted: March 12, 2011 Published: March 24, 2011 3058

dx.doi.org/10.1021/ac103308x | Anal. Chem. 2011, 83, 3058–3067

Analytical Chemistry approach was developed by Kallio et al.6 However, this methodology is not well suited for the comparison of a large number of samples because multiple comparisons with each sample serving as a reference have to be made. On the other hand, parallel factor analysis (PARAFAC)7 has been used for the deconvolution and quantification of overlapping peaks in higher order data following data reduction by Fisher ratio preprocessing.8,9 The Fisher ratio/ PARAFAC algorithm performs first automated baseline correction similarly to ChromaTOF. However, contrary to ChromaTOF, statistical analysis is performed prior to applications of the other functions, such as mass spectral deconvolution and integration. Additionally, unlike in ChromaTOF, where complete data are processed, only data subregions of interest are included in the further processing in the PARAFAC approach. Although novel approaches have been developed that allow relatively large subsets of the data to be processed with PARAFAC, the main challenge of using PARAFAC is the time required for the data treatment.10,11 Also parallel computing methods have been recently developed for the processing of the GCGC/TOFMS data.12 A recent study utilizing commercial software for the treatment of GCGC/ TOFMS data showed that the commercial method was very time-consuming and was feasible only for relatively small sample sets of up to 3050 samples.13 Several strategies have been proposed for the alignment of GCGC data. Most of the methods are based on similar procedures that were originally developed for one-dimensional chromatographic data. It should be noted, however, that in the two-dimensional separations the alignment is more critical because of the inherently higher relative variability of the retention times in the very short second-dimension time window. The alignment procedures developed for the 2D data include an algorithm based on the minimization of the pseudorank of a matrix formed by the juxtaposition of a reference chromatogram and a chromatogram to be aligned, developed by Fraga et al.,14 and a piecewise alignment to GCGC chromatograms, developed by Pierce et al.15 More recently, Zhang and colleagues developed a methodology in which a piecewise linear correlation optimized warping algorithm was used for the alignment of GCGC/MS data,16 and Suits et al. applied Warp2D for aligning the liquid chromatographymass spectrometry (LC/MS) data.17 A similar approach, combining dynamic time warping for alignment followed by principal component analysis or independent component analysis, has also been used for the analysis of GCGC data.18 In addition to developments of data preprocessing methodologies, several automated peak-based classification algorithms have been developed for various types of applications. Application of pattern recognition in mass spectrometric data obtained with GCGC/MS was introduced by Welthagen and colleagues.1 The classification methodology developed by this group includes several recognition patterns based on mass fragmentation patterns and relative ion yields as well as two-dimensional gas chromatographic retention times. With a set of predefined rules, sum parameters of alkanes, alkenes, cycloalkanes, alkane acids, alkylsubstituted benzenes, polar benzenes with or without alkyl groups, partly hydrated naphthalenes and alkyl-substituted benzenes, naphthalene, and alkyl-substituted naphthalenes can be classified. Further compound groups were recently added to the methodology.19 For the above-mentioned compounds, it is relatively easy to develop a classifier based on their spectra. However, in the case of more polar, derivatized compounds, the classification is not as straightforward. For example, trimethylsilylated polar compounds, such as carboxylic acids, amino acids, and sugars, have

ARTICLE

much more complex spectral variety, and the retention times of the derivatized compounds in the two dimensions are not characteristic.20 Thus, simple classification does not work well for such compounds. For GC/MS data, identification procedures rely on compound library comparison. However, using the spectral data only for the identification is not sufficient, and retention index (RI) data should also be used to facilitate the identification. Currently, identification using both spectral and RI data is only possible by a time-consuming, manually supervised matching of both the RI information and the reference mass spectra stored in dedicated libraries such as the Golm Metabolome Database (GMD), which is particularly suitable for metabolomic studies.21 The GMD archives respective mass spectral tags (MSTs) to provisionally accommodate unidentified compounds. The MSTs are defined to represent the combination of chemophysical properties, namely, the mass fragmentation patterns linked to the chromatographic RI information.22 Therefore, the GMD may represent an ideal resource for the application of supervised machine learning algorithms for compound classification as a means for the automated annotation of MSTs. The GMD compendium may thus be used to enhance the chemical identification process of novel metabolic components discovered by GC/(TOF)MS based metabolomic screening studies. Herein, we present a method for efficient treatment of large sample sets produced by the GCGC/TOFMS system, implemented as a freely available open source software package. The methodology utilizes preprocessed data, and in principle any type of vendor software can be applied for this step. First, the GCGC/TOFMS data are treated with the vendor software (baseline correction, mass spectral deconvolution, peak picking, integration, library search, and signal/noise filtering). The data are then transferred into the new software named Guineu in text format for the final data processing, including alignment of the peaks, normalization, filtration procedures, and calculation of the difference in literature retention indexes, and for group-type identification. The different features of the software are studied in detail, and the performance of the system is verified by analysis of a large set of standard samples as well as a large set of authentic samples, also including control samples. The quantitative features of the GCGC/TOFMS methodologies are also studied to further validate the methodology.

’ EXPERIMENTAL SECTION Materials and Methods. 4-Coumaric acid, ferulic acid, and gallic acid were purchased from Extrasynthese (Genay, France). Benzoic acid (BA), 3-hydroxybenzoic acid, 3-(4-hydroxyphenyl)propionic acid, 3-(3,4-dihydroxyphenyl)propionic acid, 3,4dihydroxytoluene, 3,4-dimethoxybenzoic acid, and 3-coumaric acid were products from Aldrich (Steinheim, Germany). 4-Hydroxybenzoic acid, 2-(3-hydroxyphenyl)acetic acid, 2-(3,4-dihydroxyphenyl)acetic acid, and hippuric acid were purchased from Sigma (St. Louis, MO). 3-Phenylpropionic acid, sinapic acid, 3,4dihydroxybenzoic acid, and vanillic acid were from Fluka (Buchs, Switzerland). 3-(3-Hydroxyphenyl)propionic acid was purchased from Alfa Aesar (Karlsruhe, Germany). L-Alanine, L-aspartic acid, L-glutamic acid, glycine, L-isoleucine, L-leucine, L-lysine, L-methionine, L-phenylalanine, L-proline, L-serine, L-threonine, L-tyrosine, Lvaline, and DL-valine-2,3,4,4,4,5,5,5-d8 were purchased from Sigma. Hexane, undecane, pentadecane, heptadecane, heneicosane, pentacosane, and heptadecanoic acid were products of Fluka (St. Louis, MO). O-Methylhydroxylamine hydrochloride (MOX) and 3059

dx.doi.org/10.1021/ac103308x |Anal. Chem. 2011, 83, 3058–3067

Analytical Chemistry N-methyl-N-(trimethylsilyl)trifluoroacetamide (MSTFA) from Pierce (Rockford, IL) were used as derivatization reagents. Samples. The plasma samples (30 μL) were spiked with internal standards (10 μL of heptadecanoic acid, c = 200 μg/mL, and labeled DL-valine, 40 μg/mL), and the mixture was then extracted with 400 μL of methanol. After centrifugation the supernatant was evaporated to dryness, and the original metabolites were then converted into their TMS and methoxime (MEOX) derivative(s) by two-step derivatization. First, 25 μL of MOX reagent was added to the residue, and the mixture was incubated for 60 min at 45 °C. Next, 25 μL of MSTFA was added, and the mixture was incubated for 60 min at 45 °C. Finally, a retention index standard mixture (n-alkanes) and an injection standard (4,40 -dibromooctafluorobiphenyl), both in hexane, were added to the mixture. The purified fecal water was extracted twice in ethyl acetate. The ethyl acetate was evaporated to dryness and derivatized with the same procedure as the plasma samples. GCGC/TOFMS Analysis. For the analysis, a Leco Pegasus 4D GCGC/TOFMS instrument (Leco Corp., St. Joseph, MI) equipped with a cryogenic modulator was used. The GC part of the instrument was an Agilent 6890 gas chromatograph (Agilent Technologies, Palo Alto, CA) equipped with a split/ splitless injector. The first-dimension chromatographic column was a 10 m RTX-5 capillary column with an internal diameter of 0.18 mm and a stationary-phase film thickness of 0.20 μm, and the second-dimension chromatographic column was a 1.5 m BPX-50 capillary column with an internal diameter of 100 μm and a film thickness of 0.1 μm. A methyl-deactivated retention gap (3 m  0.53 mm i.d.) was used in the front of the first column. High-purity helium was used as the carrier gas in a constant-pressure mode (39.6 psig). A 5 s separation time was used in the second dimension. Electron impact ionization was applied, and the MS spectrum was measured at 45700 amu with 100 spectra/s. For the injection, a pulsed splitless injection (0.5 μL) at 240 °C was utilized, with a pulse pressure of 55 psig for 1 min. The temperature program was as follows: the firstdimension column oven ramp began at 40 °C with a 2 min hold, after which the temperature was programmed to 295 °C at a rate of 7 °C/min and then held at this value for 3 min. The second-dimension column temperature was maintained 20 °C higher than the corresponding first-dimension column. The programming rate and hold times were the same for the two columns. Data Analysis. Automatic peak detection and mass spectrum deconvolution were performed using a peak width set to 0.2 s. Peaks with signal-to-noise (S/N) values lower than 10 were rejected. The S/N values were based on the masses chosen by the software for quantification. ChromaTOF version 4.22 was used for the raw data processing. The peak areas from total ion chromatography (TIC) were used for most of the compounds; for compounds that were quantified with the ChromaTOF software, peak areas of selected characteristic m/z were used.

’ RESULTS AND DISCUSSION The main aim of this study was to develop a software tool for the data processing of GCGC/TOFMS data. The suitability of the data analysis program was tested for several types of test samples as well as authentic sample sets. First, the GCGC/ TOFMS method was optimized for the profiling of serum, tissue,

ARTICLE

Table 1. Quantitative Features for Selected Compoundsa compound

m/z LOD (ng)

R

RSD (%)

L-alanine

116

0.5

0.9967

8.4

L-valine

144

0.3

0.9999

1.5

L-leucine

158

0.6

0.9983

3.7

L-isoleucine

158

0.5

0.9987

2.8

L-proline

142

4.0

0.9984

6.4

glycine

174

1.0

0.9947

13.3

L-serine

204

10

0.9983

7.4

L-threonine

219 176

10 20

0.9914 0.9845

10.0 1.4

232

20

L-methionine L-aspartic

acid

L-phenylalanine L-glutamic

218

acid

246

2.0 10

0.9932

1.5

0.9926

5.8

0.9872

9.5

L-ornithine

142

1.0

0.9917

3.5

L-tyrosine

218

1.0

0.9854

4.2

3-hydroxybutyric acid

233

0.3

0.9961

5.0

palmitic acid linoleic acid

313 337

0.1 1.0

0.9915 0.9992

2.0 4.3

oleic acid

339

0.3

0.9981

2.7

stearic acid

117

0.2

0.9961

4.8

arachidonic acid

117

1.5

0.9993

2.9

cholesterol

129

0.7

0.9986

2.4

benzoic acid

179

1.5

0.9892

6.1

3,4-dihydroxytoluene

268

0.2

0.9897

10.8

3-phenylpropionic acid 4-hydroxybenzoic acid

104 267

1.0 1.5

0.9970 0.9994

8.0 5.4

2-(3-hydroxyphenyl)acetic acid

164

0.6

0.9925

5.2

3-hydroxybenzoic acid

223

0.8

0.9903

6.1

3,4-dimethoxybenzoic acid

195

2.0

0.9987

11.4

3-(3-hydroxyphenyl)propionic acid

205

1.0

0.9987

10.1

3-(4-hydroxyphenyl)propionic acid

179

0.7

0.9989

6.9

vanillic acid

297

2.0

0.9973

7.2

3,4-dihydroxybenzoic acid 2-(3,4-dihydroxyphenyl)acetic acid

193 179

0.4 0.2

0.9990 0.9904

8.8 3.0

3-(3,4-dihydroxyphenyl)propionic acid 179

0.6

0.9984

11.9

ferulic acid

7.0

0.9953

4.9

338

a

RSD results are for serum and fecal samples (n = 14, for LOD, threshold for S/N = 10).

and fecal fermentation samples. The quantitative features, including repeatability, linearity, and limits of detection (LOD; S/ N = 10) and quantification (LOQ; S/N = 100) of the method, were first studied. The features of the program were then studied in detail utilizing the experimental data. GCGC/TOFMS Methodology for Quantitative Analysis of Metabolomic Samples. The GCGC/TOFMS method developed for the study allows profiling of several types of biological samples and simultaneously allows quantitative analysis of several target amino acids, carboxylic acids, and phenolic acids. The sample preparation was straightforward, including nonselective extraction and two-step derivatization. The quantitative parameters of the method are given in Table 1. The results show that the method is linear in the tested range and the sensitivity of the method is good. The relative standard deviations (RSDs) shown in the table are calculated for the real samples, namely, serum samples and fecal fermentation samples. Also the long-time 3060

dx.doi.org/10.1021/ac103308x |Anal. Chem. 2011, 83, 3058–3067

Analytical Chemistry

ARTICLE

Figure 1. Workflow for the data analysis.

repeatability of the method is good. In a data set consisting of 440 serum samples acquired over a period of 20 days, the average RSD was under 10% for the internal standards. The day-to-day repeatability was studied with control serum samples over three months, and the RSD values for quantified compounds (amino acids and carboxylic acids) were also on average below 10%, ranging from 3% to 17%. As the GCGC/ TOFMS method is used for profiling purposes and the majority of the compounds are determined in a semiquantitative manner, the repeatability was also studied for the same control serum samples using peak areas, including all major peaks in the investigation (220 peaks). The average RSD for peak areas utilizing the TIC trace was below 24%, further demonstrating the ruggedness of the method. Data Processing Software. The data processing methods were implemented in a stand-alone Java (http://www.java.com) application named Guineu. Guineu has a modular design which allows users with basic knowledge of Java programming language to add or change algorithms easily. In addition, Guineu software allows multitasking, therefore taking advantage of computers with several cores or processors. Guineu’s source code is published under GNU General Public License and can be downloaded from https://code.google.com/p/guineu/, where the documentation of all its modules can be found. The workflow for the data treatment is shown in Figure 1. First, the raw GCGC/TOFMS data are processed with the Leco ChromaTOF software, which is utilized for baseline correction, deconvolution, and peak picking and integration. The data are then transferred into the Guineu software in text format with the data containing retention time data, the calculated RIs, spectral information, identification (best match), spectral similarity, and peak area and/or concentration. The Guineu features are summarized in Table 2.

Alignment. First, the data files are aligned utilizing the two retention times, spectral information, and identification (“score alignment”) as summarized in Table 3. Here, it is possible to use either peak areas for all compounds or concentrations for those compounds that have been quantified with, e.g., ChromaTOF software and peak areas for all other compounds. The alignment algorithm is called score alignment. First, the user has to define a retention time window using both the two retention times (RT1 and RT2) and the RI. The RI is based on the first-dimension retention time. For each peak in the first sample, the software opens a new thread that searches for the corresponding peaks from the rest of the samples, allowing the alignment of all peaks in parallel. Then a path is constructed, one path comprising peaks across all samples which correspond to one specific compound. Table 4 presents the algorithm for constructing a path. The average values of RTs (RT1.mean, RT2.mean) and RI (RI.mean) that are already in the path are used to search for the matching peak from a new sample. That is, for each path in parallel, the user-defined (RT1w, RT2w, RIw) window is used to get a group of candidate peaks from each sample that has not yet been aligned by searching the neighborhood around the average RTs and RI of the path (i.e., the peaks within the [RT1.mean  RT1w, RT1.mean þ RT1w]  [RT2. mean  RT2w, RT2.mean þ RT2w]  [RI.mean  RIw, RI. mean þ RIw] window). Each new candidate peak is then scored on the basis of the sum of the differences between the retention times and retention index of the peak and the mean RTs and RI of the path. The user must also define a threshold of spectral similarity and decide if the identification of the compound can be used to discard some of the candidate peaks. Therefore, a candidate peak may be discarded from the possible alignment if the spectral similarity is less than the threshold or the identification of the peak is conflicting with the path. The best matching candidate peak is added to the path. 3061

dx.doi.org/10.1021/ac103308x |Anal. Chem. 2011, 83, 3058–3067

Analytical Chemistry

ARTICLE

Table 2. Summary of the Method Features name

parameters

comments

score alignment

RT1, RT2, RI, spectra, name

aligns data on the basis of given criteria

name filter

this module deletes all compounds listed by the user; the

useful if samples contain typical interferences

list of names must be in a text format file, and each

from, e.g., derivatization reagents, solvent, etc.

name must be in a different row peak count filter

this module deletes all rows in a selected peak list that contains

removes peaks present in only a few samples

fewer peaks than the number set by the user calculate deviations

is based on a custom-made list of retention indexes

together with spectral match makes the

similarity filter

(literature or own library values) this algorithm filters all compounds that have less similarity than the

identification more reliable removes erroneous identification

user-defined similarity; the user can choose between two kinds of similarity (maximum and mean), and the filter can delete the compounds or rename them to “unknown” singling filter

the algorithm chooses, between compounds with the same name, the one which contains the largest peak and also the peak nearest to the ideal on the basis of its similarity; these two characteristics can correspond to the same peak, and in that case the algorithm will leave only this peak and filter out the rest with the same name

group identification filter

links to the Golm database for subgroup-type identification

name identification filter

links to the Golm database for identification based on average spectra and RI

maximum allowed RI difference is 25; spectral

remove nonpolar compounds

removes compounds that do not have m/z 73 or those

in the analysis of polar, silylated compounds,

linear normalizer

this algorithm normalizes the intensities of the compounds using one or more standard compounds

recalculate intensities

if there are compounds in the data set with intensities calculated

threshold cannot be given for which the intensity is 800 and/or RI difference