An Automated Data Analysis Pipeline for GC−TOF ... - ACS Publications

Sep 9, 2010 - Department of Bioinformatics & Genomics, University of North Carolina at Charlotte, North Carolina Research. Campus, Kannapolis, North ...
0 downloads 0 Views 3MB Size
An Automated Data Analysis Pipeline for GC-TOF-MS Metabonomics Studies Wenxin Jiang,†,# Yunping Qiu,‡,# Yan Ni,† Mingming Su,§ Wei Jia,‡ and Xiuxia Du*,† Department of Bioinformatics & Genomics, University of North Carolina at Charlotte, North Carolina Research Campus, Kannapolis, North Carolina 28081, Department of Nutrition, University of North Carolina at Greensboro, North Carolina Research Campus, Kannapolis, North Carolina 28081, and David H. Murdock Research Institute, North Carolina Research Campus, Kannapolis, North Carolina 28081 Received July 25, 2010

Recent technological advances have made it possible to carry out high-throughput metabonomics studies using gas chromatography coupled with time-of-flight mass spectrometry. Large volumes of data are produced from these studies and there is a pressing need for algorithms that can efficiently process and analyze data in a high-throughput fashion as well. We present an Automated Data Analysis Pipeline (ADAP) that has been developed for this purpose. ADAP consists of peak detection, deconvolution, peak alignment, and library search. It allows data to flow seamlessly through the analysis steps without any human intervention and features two novel algorithms in the analysis. Specifically, clustering is successfully applied in deconvolution to resolve coeluting compounds that are very common in complex samples and a two-phase alignment process has been implemented to enhance alignment accuracy. ADAP is written in standard C++ and R and uses parallel computing via Message Passing Interface for fast peak detection and deconvolution. ADAP has been applied to analyze both mixed standards samples and serum samples and identified and quantified metabolites successfully. ADAP is available at http://www.du-lab.org. Keywords: gas chromatography-mass spectrometry • deconvolution • extracted ion chromatogram • clustering analysis • alignment

Introduction Metabonomics/Metabolomics has emerged as an integral part of systems biology and functional genomics. It attempts to identify and quantify metabolites that are synthesized by a biological system1,2 and has found wide applications in drug discovery, disease diagnosis,3 plant biotechnology,4 nutrition science,5 and toxicology studies6 by profiling and fingerprinting metabolites in biofluids and tissues. Common analytical platforms that are used in metabonomics research include NMR spectroscopy, gas chromatography-mass spectrometry (GC-MS), liquid chromatography-mass spectrometry (LC-MS), and capillary electrophoresis-mass spectrometry (CE-MS).7 The high sensitivity of modern mass spectrometers has made MS-based metabonomics the method of choice for studies that involve the identification and quantitation of lowconcentration metabolites from complex samples. For MS-based metabonomics, LC-MS and GC-MS are complementary technologies with the latter being especially well suited for metabolites that are volatile and do not ionize well by LC-MS techniques. In addition, GC-MS provides high * To whom correspondence should be addressed: Xiuxia Du, 500 Laureate Way, Suite 2350, Kannapolis, NC 28081, Phone: (704) 250-5754, Fax: (704) 250-5759, Email: [email protected]. † University of North Carolina at Charlotte. # These authors contributed equally. ‡ University of North Carolina at Greensboro. § David H. Murdock Research Institute.

5974 Journal of Proteome Research 2010, 9, 5974–5981 Published on Web 09/09/2010

chromatographic resolution and permits separation of structurally similar compounds that would be difficult to separate by LC. Primary mass analyzers that are coupled to GC separations are quadrupole and TOF instruments. TOF-MS allows fast spectra acquisition, which facilitates deconvolution of coeluting metabolites.8 Large volumes of many complex data sets can be produced from GC-TOF-MS platforms for systems biology or functional genomics studies. A number of software applications have been developed to analyze GC-MS data. Representative software include MetAlign,9 MathDAMP,10 MetaQuant,11 MET-IDEA,12 MCR (i.e., multivariate curve resolution),13-15 AMDIS,16 TagFinder,17 and ChromaTOF, among others. Despite their successful applications, these data analysis platforms have limited capacity to process data sets in a high-throughput fashion and cause data analysis to be very lengthy. Additionally, these existing software tools suffer from a range of other limitations. For example, AMDIS does not perform alignment. MCR is highly sensitive to the selection and number of coanalyzed chromatogram files.17 MetAlign appears not to perform deconvolution.9,17 ChromaTOF (LECO, St. Joseph, MI) was developed exclusively for the GC-EI-TOF-MS and GCxGC-EITOF-MS instruments of the vendor. Consequently, the source code is not available, which makes it very difficult for customized applications. 10.1021/pr1007703

 2010 American Chemical Society

research articles

GC-MS Metabonomics Data Analysis Pipeline To enable large-scale, GC-TOF-MS-based metabonomics studies, a high-throughput data analysis platform is needed. In this paper, we report such a platform named Automated Data Analysis Pipeline (ADAP) that features peak detection, deconvolution, alignment, and library search and allows data to flow seamlessly through these analysis steps without any human intervention. Additionally, two novel algorithms have been designed for deconvolution and alignment. The first one applies clustering analyses to deconvolute coeluting compounds that are very common in complex samples. The second one minimizes the likelihood that different compounds are incorrectly aligned with each other by using a two-phase approach that is based on extracted ion chromatograms (EIC) rather than the total ion chromatogram (TIC). ADAP is written in C++ and R and the computationally intensive peak picking and deconvolution take advantage of parallel computing via Message Passing Interface (MPI). As a result, the power of multiple computing cores that are common in modern computers can be fully utilized for fast data analysis. To our knowledge, none of the aforementioned software tools utilizes parallel computing. ADAP has been applied to analyze both mixed standards samples and serum samples and successfully identified and quantified metabolites. ADAP is available at http://www.du-lab.org.

Experimental Procedures Sample Preparation. Three different types of samples were analyzed on a GC-TOF-MS platform for the development, evaluation, and validation of ADAP. (1) Mixed Standards (MS) Samples. A total of 38 standard compounds (Supplementary Table S1) were carefully selected and mixed together with known ratios. Criteria for selecting those compounds are the following. (i) They should contain different classes of compounds that include amino acids, organic acids, fatty acids, polyamines, and ketones; (ii) they should be common in human urine or blood samples; and (iii) the retention times of compounds are spaced across the entire 30-min time range. (2) Calibration Curve (CC) Samples. Ten calibration curve samples were prepared at different dilutions from the original mixture of 20 fatty acid standards (Table 1). These fatty acids vary in biochemical properties with different number of carbons and double bonds, or different double bond positions. (3) Liver Injury (LI) Samples. Serum samples were collected from male Sprague-Dawley rats. Ten rats had acute liver injury and ten served as healthy control. The liver injury was induced by two intraperitoneal injections of CCl4 solution as detailed in our previous publication.18 GC-TOF-MS Analysis. All the standard mixtures and serum samples were prepared, derivatized, and analyzed following previously published protocols.19,20 In brief, after TMS derivatization, each 1 µL aliquot of the derivatized solution was injected in splitless mode into an Agilent 6890N gas chromatography system (Santa Clara, CA) that was coupled with a Pegasus HT time-of-flight mass spectrometer (Leco Corporation, St. Joseph, MI). The spectral acquisition rate was 20 scans/ s. In total, 15, 11, and 20 sample data sets were generated for experiments (1), (2), and (3), respectively. In the remaining part of the paper, we use “a sample” to refer to one raw data file that is produced from the mass spectrometer.

Table 1. Performance Comparison of the One-Phase vs Two-Phase Alignment with the CC Samples one-phase alignment RT (min)

11.065 13.443 14.559 16.626 17.481 17.74 18.852 19.691 19.768 19.799 19.848 20.076 20.464 21.254 21.298 21.567 21.601 21.773 22.504

compound name

SmpNra

Decanoic acid Laurate Tridecanoic acid Pentadecanoic acid Palmitoleic acid Palmitic acid Heptadecanoic acid Linolic acid Linolenic acid Oleic acid Elaidic acid Stearic acid 8.11.14-Eicosatrienoic acid Arachidonic acid Eicosapentaenoic acid 11.14-Eicosadienoic acid 11-Eicosenoic acid Arachidic acid Decosahexaenoic acid

Average

two-phase alignment

score

SmpNr

11 11 11 11

887 855 883 690

11 11 11 11

score

893 863 890 836

11 3 2

NA 605 744

11 11 11

880 819 876

10 6 NAb 3 11 NA

479 NA NA 769 749 NA

11 7 11 3 11 11

798 676 494 795 914 840

9

750

11

929

2

651

11

812

11

855

11

882

10 6 11

717 NA 813

11 11 11

880 NA 837

8

746

10

829

a

SmpNr refers to the total number of samples in which a compound was observed. b NA means that the corresponding metabolite was not detected with high confidence because the matching score is below the cutoff value that was set at 750.

Figure 1. Data analysis workflow.

Data Analysis Methods and Results Data Analysis Workflow. The data analysis workflow consists of four sequential steps (Figure 1): peak detection, deconvolution, alignment, and library search. Peak detection finds local peaks in each EIC. Deconvolution separates coeluting components based on both the apex elution time and elution profiles of their fragment ions. Each component is potentially a compound if its identity can be determined. We use component here to be consistent with the naming rule used in AMDIS.16 After deconvolution, a mass spectrum is constructed for each component that is detected in one sample. Alignment aligns the same component across all of the samples in which they were observed based on their mass spectrum similarity and retention time similarity. The mass spectrum that is selected at the alignment stage as the optimal reference is used for compound identification. Lastly, the intensity of each component is determined for quantification. Next, we describe the details of each step. Peak Detection. The first step of the workflow is to detect EIC peaks for each mass. A number of algorithms have been developed and a thorough examination and comparison of these existing algorithms can be found in the review by Yang et al.21 These algorithms usually perform a denoising step that includes chromatogram smoothing and/or baseline correction Journal of Proteome Research • Vol. 9, No. 11, 2010 5975

research articles

Jiang et al.

Figure 2. Peak detection in an EIC. Different extracted ion chromatograms are denoted by different colors with their peak apexes marked as small circles and their masses shown on the far right. The inset is a zoomed-in depiction of a small segment of the EIC and shows the EIC peaks that have been detected.

before peak detection. ADAP performs peak detection before denoising so that all of the EIC peaks can be extracted. This prevents the removal of true EIC peaks from happening that can be caused by imperfection in the denoising algorithm, and ultimately benefits identification of compounds in terms of both confidence and total number of identifications (Supplementary Figure S1). The rationale behind this is threefold. First, the observation of a larger number of fragments that belong to the same compound increases the likelihood that the identification is correct. Second, observing fragments of large mass has a positive impact on compound identifications because a larger mass is usually given a heavier weight in library search.16 Since fragments of large masses tend to produce low intensity peaks in a spectrum, measures that are taken to preserve these peaks will facilitate identifications. Lastly, many compounds of interest such as biomarkers in biological studies are in the low concentration range. Because of these reasons, we chose to preserve as much information as possible at each stage of the data processing. Each EIC peak is characterized by its apex elution time, left and right boundary, peak height, and shape. Peak detection consists of two sequential steps: peak picking and peak filtering. Peak picking is accomplished by first searching for the apex within a time window (Figure 2). For the experimental data described in this paper, this window spans nine scans which translates to 9/20 s. In the ADAP software tool, the window width is a parameter that is specified by users based on the characteristics of their data. If the window were too narrow, the peak picking process would be very susceptible to noise. On the other hand, if the window is too wide, true apexes can be missed. Following the apex detection, the corresponding left and right boundaries of each EIC peak are determined, and peak height and shape are recorded. After all of the peaks in an EIC have been characterized, peak filtering is performed to remove peaks that most likely have resulted from noise. Specifically, the EIC is divided into equallength time windows. Within each window, a window-specific threshold is calculated as the product of the lowest peak intensity and a preset factor (that is like a signal-to-noise ratio). Any other peaks with intensity below the threshold are filtered out. 5976

Journal of Proteome Research • Vol. 9, No. 11, 2010

After peak detection has been performed for all the EICs, information in the raw data on the identity and concentration of metabolites has been extracted and represented as EIC peaks. The next step is to group these peaks so that fragment ions that correspond to the same component fall in the same group. Deconvolution. When compounds in a sample are fully resolved by GC, their mass spectra can be easily constructed by simply assigning peak intensity values obtained above to the corresponding fragment mass. However, when two or multiple compounds elute from the GC system in close proximity, peaks of fragments ions from different compounds will overlap and deconvolution has to be performed in order to construct their mass spectra. Traditional deconvolution was based on the assumption that fragment ions with similar apex elution time belong to the same component.22,23 However, this assumption will not hold when the apex elution time of different components are indistinguishable. This scenario is not uncommon in complex samples and Figure 3A depicts one such scenario where EICs of fragment ions from two components have nearly the same apex elution time. A closer examination of the EIC peaks reveals that a distinguishing feature of the coeluting components lies in the shape of the EIC peaks (or profiles). This shape difference can be captured by the normalized dot product. Specifically, let the abundance of the EIC profiles of two peaks be represented as two vectors: b e x ) {a1, a2, ..., an} b e y ) {b1, b2, ..., bn} where ai and bi are the abundance values at retention timeti. The similarity between b ex and b ey can be measured by the normalized dot product:

r)

ey b ex · b |b e x| · |b e y|

where the center dot ( · ) represents the dot product.

GC-MS Metabonomics Data Analysis Pipeline

research articles

Figure 3. Illustration of deconvolution and identification of components. (A) Unresolved TIC peak and EIC peaks. The apex elution time of these EIC peaks is hardly distinguishable. (B) Optimal clustering of the EIC peaks into two groups that are shown in red and blue, respectively. (C) Separation of EIC peaks based on shapes of elution profiles. (D) Construction of mass spectra from the separated EIC peaks. (E) Identification of compounds by searching the spectra obtained in (D) against the NIST reference library. The two coeluting compounds are found to be trans-cinnamic acid and creatinine.

To separate coeluting components, the r values of all pairs of EIC peaks that are in a narrow deconvolution time window (10 scans in this study) are calculated and a similarity matrix is formed for this window. Compounds that elute within this window are considered coeluting and thus indistinguishable based on their apex elution time only. Subsequently, k-medoids clustering is applied on this matrix to cluster the fragment ions. Figure 3B depicts the clustering results for one time window within which two compounds coelute. The fragment ions of creatinine (shown in red and grouped in cluster 2) are well separated from those of trans-cinnamic acid (shown in blue

and grouped in cluster 1). The two resulting spectra are matched with creatinine and trans-cinnamic acid in the NIST library with high matching scores, which demonstrates that ADAP can effectively detect differences in the shape of EIC peaks and successfully deconvolute coeluting components (Figure 3C-E). This deconvolution process is repeated for all deconvolution windows. The k-mediods clustering requires an initial assignment of k, the number of clusters. However, k is unknown prior to deconvolution. To resolve this issue, the silhouette score24 is Journal of Proteome Research • Vol. 9, No. 11, 2010 5977

research articles used in this study to assess the clustering quality and determine the k value.

S)

dinter - dintra max(dinter, dintra)

where S is the silhouette score of a cluster, dintra is the intracluster distance and is calculated as the average pairwise distance between objects within the cluster, dinter is the intercluster distance and is the minimum average distance from all of the other clusters to this cluster. The clustering is performed for different values of k and the k that results in the largest silhouette score is ultimately selected. To avoid falsely splitting fragments from the same component into two or more groups, an intracluster distance threshold is specified. When dintra is smaller than this threshold, the corresponding k is accepted and the search stops. After deconvolution, each component that is detected in one sample is associated with its sample-specific mass spectrum and apex elution time. Because of differences in experimental conditions such as temperature and column conditions, the apex elution time that is observed for the same compound is usually shifted differently across samples and, as a result, alignment is needed to correct this shift. Alignment. A number of algorithms have been developed for chromatogram alignment.25 Most of them were designed primarily for aligning TIC. However, TIC-based alignment can be inaccurate when different compounds within the same TIC peak shift differently along the retention time axis. A better approach is to do component-based alignment. Specifically, the same components across samples are identified based on their spectrum similarity, component-specific time shifts are determined, and ultimately, components are aligned accordingly. To search for the same component across samples, a measure of confidence needs to be defined that takes into account both the spectra similarity and retention time similarity between two components. In ADAP, this measure is: scoretotal(si, sj) ) 0.9scorespec(si, sj) + 0.1scoreRT(si, sj) where si and sj denote two spectra. Scorespec is adopted from the spectra similarity measure used in AMDIS and is a linear, weighted combination of the pure and impure score.23 ScoreRT is calculated by: scoreRT ) 1 - |∆RT|/w where ∆RT is the difference in their apex retention time and w is the maximum retention time shift that is acceptable for the same component in a particular experiment. Components whose retention time difference exceeds w should be considered as different components. The incorporation of scoreRT into scoretotal facilitates distinguishing components that have similar spectra and whose apex elution time difference is less than w, particularly in the case of isomers. This approach that uses a combination of mass measurement and elution time information has been used in the proteomics field to identify and quantify peptides.26,27 The final scoretotal is scaled so that it is between 0 and 999. This is the same numerical scale that the NIST library search uses. 5978

Journal of Proteome Research • Vol. 9, No. 11, 2010

Jiang et al. The alignment process starts with the earliest component and sequentially aligns every component for which a spectrum has been constructed. What lies at the core of aligning a component in ADAP is the accomplishment of two tasks: (1) identification of mass spectra that correspond to the same component across samples, and (2) selection of the best representative spectrum for the component. These two tasks are accomplished via a two-phase searching algorithm as follows: Phase 1. Among all the samples, identify the earliest component that has not been aligned and define a global alignment window that starts with this component and is of width w. Within this alignment window, use the spectrum of the earliest component as a reference and search for components in other samples that produce scoretotal greater than a certain threshold (750 in this study). From these high-scoring components, select the one that produces the highest scoretotal for each sample. Phase 2. Use each sample-specific best-matching component as a reference and repeat the searching process in Phase 1 to find the best matching components in other samples. As a result, a group of spectra are identified for each reference. For each component, the best representative spectrum across all the samples is determined by selecting the component that produces the highest average scoretotal when it is used as a reference in Phase 2. This component will be used as the final reference to align spectra across samples. The rationale behind this is as follows. Each mass spectrum that is constructed for a component consists of two parts: the pure part that corresponds to the component itself and the impure part due to experimental noise and/or interference from coeluting components. Since the spectrum that primarily consists of the pure part should be the most reproducible and, consequently, give rise to the highest average scoretotal, it is apparently the best representation. Ultimately, alignment is carried out with the best overall spectrum serving as the reference. ADAP requires that only these components that are observed in a sufficient number of samples be aligned. Phase 2 is a refinement of Phase 1 in that spectra identified in Phase 1 may not be the best representation of the component in the corresponding samples. This can happen when the earliest spectrum that is used in Phase 1 is not the best representation (examples provided in Supplementary Figure S2). Therefore, the two-phase approach should improve the alignment performance compared to a one-phase approach where only Phase 1 is used. Specifically, more samples can be aligned and a better representative spectrum can be found for compound identification. Table 1 demonstrates these improvements by analyzing the CC samples. Figure 4 illustrates the necessity of alignment and compares the EICs before and after alignment. Deviations of retention time for the 19 standard compounds are depicted in Figure 4K. This deviation profile is consistent with the temperature change in the experimental process, that is, the maximum deviation occurs at the time when the temperature reaches its peak (approximately 20 min in this example). Compound Identification. The identity of a component is determined by searching the corresponding mass spectrum against a library of spectra. ADAP is equipped with the capability to perform library search. Since the alignment references that have been found are the best representation of components, ADAP searches them against an in-house library that includes spectra for 500 human metabolites. In addition, ADAP provides the option to export spectra in the NIST

GC-MS Metabonomics Data Analysis Pipeline

research articles

Figure 4. Necessity of alignment and comparison of EICs before and after alignment. The 15 MS samples are considered and each sample is represented by one unique color. (A-C) TICs within three different time intervals. (A) One component elutes with two distinct TIC peaks; (B) two components elute with two distinct TIC peaks; (C) two components elute with two peaks that are barely distinguishable; (D) TICs of the 15 MS samples. (E-G) EICs before alignment; (H-J) EICs after alignment. EIC pairs E-H, F-I, and G-J correspond to TIC segments (A), (B), and (C), respectively. For the two EIC pairs (F and I) and (G and J), two coeluting components became distinguishable only after alignment. (K) Deviation of the elution time of 19 compounds in the MS samples with respect to the elution time of the alignment reference.

standard MSP format that can be read by NIST MS searching software28 or other third-party software to perform library search. The spectra files in the Supporting Information provide the mass spectra that were exported by ADAP in the standard NIST format. The high matching scores (searched against the NIST library) of the compounds listed in Table 1 demonstrate that ADAP has successfully identified these compounds from the CC samples. Compound Quantitation. Among all the fragment ions that have been detected for a component, the most abundant, unique ion is invaluable for comparing the concentrations of the same compounds across samples. A unique ion is defined to be an ion that is unique to a component within a certain time range and multiple unique ions can exist for one component. Since peak area and peak height are directly related to the concentration of compounds in a sample, either of them can be used for quantification. In the current version of ADAP, peak height is used. Supplementary Figure S3 depicts the abundance values of 19 compounds versus the true concentra-

tion of the compounds in the 11 CC samples. Both of the r-squared values are close to 1, which indicates the accuracy of ADAP in extracting quantitative information of compounds. Performance. Table 2 lists performance measures of ADAP after it is applied to analyze the three sets of samples. These measures include the total number of peaks that are detected, total number of components after deconvolution, and the total number of components after alignment with the requirement that each component is observed in more than 20% of the total number of samples. Among those components that are aligned, most of them are correctly identified with the average score higher than 800. ADAP is truly a high-throughput pipeline in that: (1) it is fully automated and no human intervention is needed in the entire process; (2) the computationally intensive deconvolution and alignment are written in C++ and applies parallel computing using MPI, and (3) special care has been taken to accelerate computations by optimizing memory usage and data structure. Table 2 lists the number of samples and the corresponding Journal of Proteome Research • Vol. 9, No. 11, 2010 5979

research articles

Jiang et al. 29

Table 2. Measures of ADAP Performance MS

Total number of samples 15 Average sample data size (MB) 123 Average number of peaks 125686 detected per sample Average number of components 405 detected per sample Total number of components 478 after alignment Processing time (s) Peak picking + 75 Deconvolution Alignment 10 Total 85

CC

LI

11 67 562018

20 118 281913

596

388

304

277

62

103

25 97

29 132

processing time for the three sets of data. We analyzed 15 MS samples and 20 liver injury samples with ADAP and HDA on the same workstation (2× Intel Quad-core Xeon CPU X5570 2.93 GHz, 24 GB memory). ADAP used less than 3 min to analyze each entire set of samples, whereas HDA used 5 h for the same analysis job. To our knowledge, ADAP is much faster compared to other existing software tools as well including ChromaTOF. The average matching scores of common metabolites identified from ADAP and HDA are very similar, which indicates that the two pipelines have comparable performance in terms of identifying compounds. Biological Application. ADAP has been applied to the metabolomics analysis in an animal experiment of liver injury. From 20 rat serum samples, a total of 277 components were produced after peak detection, deconvolution, and alignment. The concentration of each component in a sample was estimated as the total sum of the intensity values of all of the constituent fragment ions that have been identified. The resultant quantitation data for all of the components from the 20 samples were imported into the SIMCA-P 12.0 software package (Umetrics, Umeå, Sweden) for multivariate statistical

analysis. Specifically, mean-centering and autoscaling were used for data pretreatment.30 Subsequently, PCA (Principal Component Analysis) was applied and a clear separation between the diseased and control groups (Figure 5A) was observed with the first two components explaining 29.8% of the total variance. Lastly, a supervised PLS-DA model (Partial Least Squares Discriminant Analysis) was constructed (Figure 5B) to identify the deferential metabolites that contribute to the separation between two groups. A total of 55 significant components were selected using VIP statistics (VIP g 1, variable importance in the projection) and Pearson correlation coefficients (|Corr(t, X)| g 0.45) of the cross-validated PLS-DA model.31 The cutoff value of correlation coefficients was used to select the variables that were most correlated with the PLSDA discriminant scores (PC1) at a significant univariate level of 0.05. Ten compounds have been identified via a NIST library search and they are alanine, lysine and phenylalanine (amino acids), citrate and 2-oxoglutarate in TCA cycle (energy metabolism), ornithine and urea in urea cycle, linoleate (unsaturated fatty acid), creatinine and cholesterol. Among them, alanine, urea and phenylalanine were also identified in the tissue samples in a previous study.18 These analysis results provide valuable pointers for further biological investigations about liver injury-induced metabolic disorder. We have also applied PCA and PLS-DA analyses to the quantitation data from HDA. The healthy control and diseased groups are well separated (Supplementary Figure S4 and Supplementary Table S2). This indicates that both ADAP and HDA are able to successfully extract quantitative information from the raw data.

Conclusions and Discussions A fully Automated Data Analysis Pipeline (ADAP) for GC-TOF-MS metabonomics studies has been presented. ADAP enables high-throughput data analyses by using parallel

Figure 5. Multivariate statistical analysis of the quantitative metabolites data extracted from the LI samples by ADAP. Black and red markers correspond to liver injury (n ) 10) and healthy controls (n ) 10) samples, respectively. (A) PCA score plot. The first four principal components account for 45.6% of the total variance. (B) PLS-DA score plot. R2Y ) 0.996 and Q2Y ) 0.641 using two principal components in total. 5980

Journal of Proteome Research • Vol. 9, No. 11, 2010

research articles

GC-MS Metabonomics Data Analysis Pipeline computing and features peak picking, deconvolution, alignment, and identification without any manual intervention. In particular, two novel algorithms have been designed for efficient deconvolution and accurate alignment of components. Currently, we are making further progress to deconvolute EICs of shared ions more accurately. Preliminary testing has demonstrated considerable improvements. Additionally, we plan to apply signal processing techniques to denoise EICs. Our goal is to filter out true noise interferences while preserving valuable information on low-concentration compounds. ADAP is a module within a comprehensive data analysis framework that is equipped with the capability to perform comparative metabonomics studies and associated visualizations of large-scale data sets.

Acknowledgment. We thank Scott Woods and Steven Blanchard for their great help by making the high performance computing resources available to this project. These facilities are located on the campus of University of North Carolina at Charlotte and the North Carolina Research Campus. Supporting Information Available: Figure S1, mass spectrum and EIC peak for L-histidine illustrating the importance of performing peak picking prior to denoising. Figure S2, example spectra demonstrating the improvement of alignment by using the two-phase over the one-phase approach. Figure S3, performance of ADAP in terms of extracting quantitative information. Figure S4, results from applying PCA and PLSDA to the quantitative data obtained by HDA after processing the liver injury samples. Table S1, list of 38 mixed standards compounds and the corresponding matching scores from searching against an in-house library using AMDIS. Table S2, quantitative data from ADAP after analysis of liver injury samples. Spectra S1, Spectra S2, Spectra S3, mass spectra files produced by ADAP in the standard NIST text format for MS, CC, and LI samples, respectively. This material is available free of charge via the Internet at http://pubs.acs.org. References (1) Nicholson, J. K.; Lindon, J. C.; Holmes, E. ‘Metabonomics’: understanding the metabolic responses of living systems to pathophysiological stimuli via multivariate statistical analysis of biological NMR spectroscopic data. Xenobiotica 1999, 29 (11), 1181–189. (2) Fiehn, O. Metabolomicssthe link between genotypes and phenotypes. Plant Mol. Biol. 2002, 48 (1-2), 155–171. (3) Lindon, J. C.; Holmes, E.; Bollard, M. E.; Stanley, E. G.; Nicholson, J. K. Metabonomics technologies and their applications in physiological monitoring, drug safety assessment and disease diagnosis. Biomarkers 2004, 9 (1), 1–31. (4) Sumner, L. W.; Mendes, P.; Dixon, R. A. Plant metabolomics: largescale phytochemistry in the functional genomics era. Phytochemistry 2003, 62 (6), 817–836. (5) Gibney, M. J.; Walsh, M.; Brennan, L.; Roche, H. M.; German, B.; van Ommen, B. Metabolomics in human nutrition: opportunities and challenges. Am. J. Clin. Nutr. 2005, 82 (3), 497–503. (6) Robertson, D. G. Metabonomics in toxicology: a review. Toxicol. Sci. 2005, 85 (2), 809–822. (7) Lenz, E. M.; Wilson, I. D. Analytical strategies in metabonomics. J. Proteome Res. 2007, 6 (2), 443–458. (8) Slotta, D. J.; Barrett, T.; Edgar, R. NCBI Peptidome: a new public repository for mass spectrometry peptide identifications. Nat. Biotechnol. 2009, 27 (7), 600–601. (9) Lommen, A. MetAlign: interface-driven, versatile metabolomics tool for hyphenated full-scan mass spectrometry data preprocessing. Anal. Chem. 2009, 81 (8), 3079–3086.

(10) Baran, R.; Kochi, H.; Saito, N.; Suematsu, M.; Soga, T.; Nishioka, T.; Robert, M.; Tomita, M. MathDAMP: a package for differential analysis of metabolite profiles. BMC Bioinf. 2006, 7, 530. (11) Bunk, B.; Kucklick, M.; Jonas, R.; Munch, R.; Schobert, M.; Jahn, D.; Hiller, K. MetaQuant: a tool for the automatic quantification of GC/MS-based metabolome data. Bioinformatics 2006, 22 (23), 2962–2965. (12) Broeckling, C. D.; Reddy, I. R.; Duran, A. L.; Zhao, X.; Sumner, L. W. MET-IDEA: data extraction tool for mass spectrometry-based metabolomics. Anal. Chem. 2006, 78 (13), 4334–4341. (13) Jonsson, P.; Gullberg, J.; Nordstrom, A.; Kusano, M.; Kowalczyk, M.; Sjostrom, M.; Moritz, T. A strategy for identifying differences in large series of metabolomic samples analyzed by GC/MS. Anal. Chem. 2004, 76 (6), 1738–1745. (14) Jonsson, P.; Johansson, A. I.; Gullberg, J.; Trygg, J.; A, J.; Grung, B.; Marklund, S.; Sjostrom, M.; Antti, H.; Moritz, T. High-throughput data analysis for detecting and identifying differences between samples in GC/MS-based metabolomic analyses. Anal. Chem. 2005, 77 (17), 5635–5642. (15) Jonsson, P.; Johansson, E. S.; Wuolikainen, A.; Lindberg, J.; Schuppe-Koistinen, I.; Kusano, M.; Sjostrom, M.; Trygg, J.; Moritz, T.; Antti, H. Predictive metabolite profiling applying hierarchical multivariate curve resolution to GC-MS datasa potential tool for multi-parametric diagnosis. J. Proteome Res. 2006, 5 (6), 1407–1414. (16) Stein, S. E. An integrated method for spectrum extraction and compound identification from gas chromatography/mass spectrometry data. J. Am. Soc. Mass Spectrom. 1999, 10 (8), 770–781. (17) Luedemann, A.; Strassburg, K.; Erban, A.; Kopka, J. TagFinder for the quantitative analysis of gas chromatographysmass spectrometry (GC-MS)-based metabolite profiling experiments. Bioinformatics 2008, 24 (5), 732–737. (18) Pan, L.; Qiu, Y.; Chen, T.; Lin, J.; Chi, Y.; Su, M.; Zhao, A.; Jia, W. An optimized procedure for metabonomic analysis of rat liver tissue using gas chromatography/time-of-flight mass spectrometry. J. Pharm. Biomed. Anal. 2010, 52 (4), 589–596. (19) Li, H.; Xie, Z.; Lin, J.; Song, H.; Wang, Q.; Wang, K.; Su, M.; Qiu, Y.; Zhao, T.; Song, K.; Wang, X.; Zhou, M.; Liu, P.; Zhao, G.; Zhang, Q.; Jia, W. Transcriptomic and metabonomic profiling of obesityprone and obesity-resistant rats under high fat diet. J. Proteome Res. 2008, 7 (11), 4775–4783. (20) Qiu, Y.; Cai, G.; Su, M.; Chen, T.; Zheng, X.; Xu, Y.; Ni, Y.; Zhao, A.; Xu, L. X.; Cai, S.; Jia, W. Serum metabolite profiling of human colorectal cancer using GC-TOFMS and UPLC-QTOFMS. J. Proteome Res. 2009, 8 (10), 4844–4850. (21) Yang, C.; He, Z.; Yu, W. Comparison of public peak detection algorithms for MALDI mass spectrometry data analysis. BMC Bioinf. 2009, 10, 4+. (22) Colby, B. N. Spectral deconvolution for overlapping GC/MS components. J. Am. Soc. Mass Spectrom. 1992, 3, 558–562. (23) Stein, S. E. An integrated method for spectrum extraction and compound identification from GC/MS data. J. Am. Soc. Mass Spectrom. 1999, 10, 770–781. (24) Rousseeuw, P. J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Comput. Appl. Math. 1987, 20, 53–65. (25) van Nederkassel, A. M.; Daszykowski, M.; Eilers, P. H. C.; Heyden, Y. V. A comparison of three algorithms for chromatograms alignment. J. Chromatogr., A 2006, 1118 (2), 199–210. (26) Zimmer, J. S.; Monroe, M. E.; Qian, W. J.; Smith, R. D. Advances in proteomics data analysis and display using an accurate mass and time tag approach. Mass Spectrom. Rev. 2006, 25 (3), 450–82. (27) Jaffe, J. D.; Mani, D. R.; Leptos, K. C.; Church, G. M.; Gillette, M. A.; Carr, S. A. PEPPeR, a platform for experimental proteomic pattern recognition. Mol. Cell. Proteomics 2006, 5 (10), 1927–1941. (28) NIST MS Search Program (software version 2.0). http://www.nist. gov/srd/nist1a.htm. (29) Trygg, J.; Holmes, E.; Lundstedt, T. Chemometrics in metabonomics. J. Proteome Res. 2007, 6 (2), 469–479. (30) van den Berg, R. A.; Hoefsloot, H. C.; Westerhuis, J. A.; Smilde, A. K.; van der Werf, M. J. Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics 2006, 7, 142. (31) Yan, N.; Mingming, S.; Jinchao, L.; Xiaoyan, W.; Yunping, Q.; Aihua, Z.; Tianlu, C.; Wei, J. Metabolic profiling reveals disorder of amino acid metabolism in four brain regions from a rat model of chronic unpredictable mild stress. FEBS Lett. 2008, 582 (17), 2627–2636.

PR1007703

Journal of Proteome Research • Vol. 9, No. 11, 2010 5981