TOFSIMS-P: A Web-Based Platform for Analysis of ... - ACS Publications

Nov 4, 2011 - TOF-SIMS Data. So Jeong Yun,. †. Ji-Won Park,. ‡,§. Il Ju Choi,|| Byeongsoo Kang,. †. Hark Kyun Kim,|| Dae Won Moon,. ‡,§. Tae...
0 downloads 0 Views 4MB Size
ARTICLE pubs.acs.org/ac

TOFSIMS-P: A Web-Based Platform for Analysis of Large-Scale TOF-SIMS Data So Jeong Yun,† Ji-Won Park,‡,§ Il Ju Choi,|| Byeongsoo Kang,† Hark Kyun Kim,|| Dae Won Moon,‡,§ Tae Geol Lee,*,‡,§ and Daehee Hwang†,^,z,* †

School of Interdisciplinary Bioscience and Bioengineering, POSTECH, Pohang, Republic of Korea Center for Nano-Bio Convergence, Korea Research Institute of Standards and Science, Daejeon, Republic of Korea § Department of Nano and Bio Surface Science, University of Science and Technology, Daejeon, Republic of Korea National Cancer Center, Goyang, Republic of Korea ^ Department of Chemical Engineering, POSTECH, Pohang, Republic of Korea z Division of Integrative Biosciences and Biotechnology, POSTECH, Pohang, Republic of Korea

)



bS Supporting Information ABSTRACT:

Time-of-flight secondary ion mass spectrometry (TOF-SIMS) has been a useful tool to profile secondary ions from the near surface region of specimens with its high molecular specificity and submicrometer spatial resolution. However, the TOF-SIMS analysis of even a moderately large size of samples has been hampered due to the lack of tools for automatically analyzing the huge amount of TOF-SIMS data. Here, we present a computational platform to automatically identify and align peaks, find discriminatory ions, build a classifier, and construct networks describing differential metabolic pathways. To demonstrate the utility of the platform, we analyzed 43 data sets generated from seven gastric cancer and eight normal tissues using TOF-SIMS. A total of 87 138 ions were detected from the 43 data sets by TOF-SIMS. We selected and then aligned 1286 ions. Among them, we found the 66 ions discriminating gastric cancer tissues from normal ones. Using these 66 ions, we then built a partial least square-discriminant analysis (PLS-DA) model resulting in a misclassification error rate of 0.024. Finally, network analysis of the 66 ions showed disregulation of amino acid metabolism in the gastric cancer tissues. The results show that the proposed framework was effective in analyzing TOFSIMS data from a moderately large size of samples, resulting in discrimination of gastric cancer tissues from normal tissues and identification of biomarker candidates associated with the amino acid metabolism.

T

ime-of-flight secondary ion mass spectrometry (TOFSIMS) is useful to analyze the chemical compositions and distributions of positive and negative secondary ions from the very near surface region of specimens “as received” with high molecular specificity, high surface sensitivity, and submicrometer spatial resolution.1,2 Recently, TOF-SIMS has been successfully applied to complex biological samples, such as cells3 6 and tissues.7 12 The resulting TOF-SIMS spectra represent the surface chemical characteristics of biological samples being investigated. A TOF-SIMS spectrum generated from tissues commonly contains hundreds or r 2011 American Chemical Society

even thousands of ion peaks. Because of the enormous quantities of information, it is often challenging to identify the peaks and compare every peak from the TOF-SIMS spectra of even a moderately large size of samples. However, there is a lack of suitable tools to automatically analyze a vast amount of high-dimensional data that are generated from a large number of samples using TOF-SIMS. Received: July 16, 2011 Accepted: November 4, 2011 Published: November 04, 2011 9298

dx.doi.org/10.1021/ac2016932 | Anal. Chem. 2011, 83, 9298–9305

Analytical Chemistry Several tools have been introduced for the analysis of the TOF-SIMS data. First, IonSpec and TOFPak have been used to identify the ion peaks from a spectrum. Second, for comparative analysis of abundances of ions between multiple samples, the peaks from different samples should be combined, called peak alignment, which is normally done manually and can be complicated for a large size of samples. Next, for interpretation of the TOF-SIMS data, including identification of the ions that can discriminate among different groups of samples (e.g., diseased and normal samples) and visualization of the samples, several multivariate statistical analyses (MVAs13 19), such as principal component analysis (PCA) and partial least squares (PLS), have been applied to the aligned peaks. They were shown to be appropriate for analysis of the complex data generated from various biological specimens, including proteins,20 24 cells,25 and tissues.26 29 Finally, the discriminatory ions selected from the MVAs are commonly identified using commercial or public databases by matching their masses with those of known metabolites. However, this database search can be complicated by multiple matches for a single ion, mostly because of a moderately large mass tolerance used for TOF-SIMS (e.g., 0.1 Da). Of note, many of these analyses have been typically done manually with no available automatic tools. Here, we present a computational framework, named TOFSIMS-P, for effective analysis of the large-sized TOF-SIMS data that can automatically perform peak identification, alignment of peaks, identification of discriminatory ions, and construction of a classifier and metabolic networks describing differential pathways. To show the validity, we applied the framework to 43 TOFSIMS data sets generated from seven gastric cancer and eight normal tissue samples. Among a total of 87 138 ions detected, we selected 1286 ions and then aligned them using a template-based method. We then selected 66 ions discriminating gastric cancer tissues from normal controls based on both the analysis of variance (ANOVA) and PLS-DA. Using the 66 ions, we then built a PLS-DA model resulting in a low misclassification error rate from cross-validations. Finally, we reconstructed network models using the 66 ions. The networks revealed disregulation of metabolism of amino acids, such as arginine, proline, and phenylalanine, in the gastric cancer. The results indicate that the proposed framework could support effective analysis of the TOF-SIMS data collected from a moderately large size of samples. The TOFSIMS-P is available at http://sbm.postech. ac.kr/TOFSIMSP/.

’ MATERIALS AND METHODS Sample Preparation. Tissues were obtained at the National Cancer Center in Korea from 2005 to 2008, with informed consent and institutional review board approval. Samples were flash frozen in liquid nitrogen and stored at 80 C until analysis.30 No chemical fixation was done because of the possibility of the chemical fixatives reacting with the molecules that would be detected by TOF-SIMS. Two serial sections of each tissue at 10 μm-thickness were done at 20 C using a cryostat (Leica CM 3050S, Leica Microsystems Inc., Bannockburn, IL). One section was affixed to a slide glass and stained with hematoxylin and eosin (H&E). An optical image was acquired via a microscope attached to a digital camera. The other section was deposited onto Si wafer that was sonicated with ethanol and acetone for 5 min each and rinsed with water and directly analyzed by TOF-SIMS.

ARTICLE

TOF-SIMS Analysis. We performed ion profiling of normal and tumor tissues using a TOF-SIMS V instrument (ION-TOF GmbH, Germany) equipped with a bismuth liquid metal ion gun (LMIG). Bi3+ primary ions at 25 keV in the high-current bunched mode were used to obtain positive and negative spectra. The analysis area of 100  100 μm2 was randomly rastered by the primary ions with a spatial resolution of 1 μm and were chargecompensated for the tissue samples by low-energy electron flooding. The primary ion dose density was maintained below 1012 ions 3 cm 2 to ensure static SIMS condition. Mass resolution was higher than 7000 at m/z < 500 in both the positive and negative modes. Positive and negative ion spectra from each sample at specific areas were obtained. The individual spectra export to ASCII files. Positive and negative ion spectra were internally calibrated using CH3+, C2H3+, C3H5+, and C5H14NO+ peaks, and CH , C2H , C4H and C18H35O2 peaks, respectively. After the calibration, the resulting mass accuracy was 10 ppm for m/z < 200 and 15 ppm for m/z > 200 on average. Identification of the Ion Peaks. The “TOFBAT” program in the built-in IonSpec software from ION-TOF was used to automatically identify the peaks from a batch of TOF-SIMS spectra. Using TOFBAT, we set up a batch job in the batch-jobeditor using the following macro functions in “Macro Toolbox”: (1) “For $A from 1 to n do” in “System Macro” to analyze multiple spectra using the “For-loop”; (2) “LoadSpec()” in “IonSpec Macro” to load each spectrum and identify the peaks within the loop; and (3) “SaveSpecASCII()” to save the peaks from each spectra into an ASCII file within the loop. The “Autosearch” algorithm was used for the peak detection. In the “SaveSpecASCII” macro function, we set the mass range between 1 and 800 and the intensity threshold to 20, resulting in the peaks whose intensities are larger than 20 within m/z = 1 to 800. For each spectrum, the resulting ASCII file includes m/z, m/z ranges, and the areas of the peaks. Generation of a Template for Peak Alignment. Using the list of the sorted peaks from all the spectra by their intensities, we generate a template including nonredundant peaks as follows. First, we add the peak with the largest intensity (rank 1) in the list to the template. Second, for the next peak (rank 2) in the list, we evaluate whether it is redundant with the peak in the template. For this evaluation, we compute the widths of the peak (rank 2) and the one in the template (rank 1) at 3/4 of their intensity, respectively, and then their overlap. With a nonzero overlap, we conclude that the peak is redundant with the one in the template skip this peak (rank 2) and move to the next peak (rank 3) in the list, and then repeat the redundancy evaluation process. However, with no overlap, the peak (rank 2) was considered nonredundant with the one (rank 1) in the template and thus add it to the template. We repeat this procedure for all the peaks in the list. When there are k peaks (k > 1) in the template, the peak being evaluated about its redundancy is compared with all the k reference peaks in the template. Template-Based Alignment Method. Each peak from a spectrum is aligned with its most overlapping reference peak in the template. To identify the most overlapping peak, we first calculate (1) the mean difference (di) between the peak being aligned and reference peak i and (2) the overlap (vi) between the widths of the peak and reference peak i at 3/4 of their intensities. We then compute their ratio, di/vi. Each peak was finally aligned to its most overlapping reference peak having the smallest ratio among all the reference peaks. This procedure is repeated for the peaks identified from each spectrum. 9299

dx.doi.org/10.1021/ac2016932 |Anal. Chem. 2011, 83, 9298–9305

Analytical Chemistry

ARTICLE

Figure 1. Peak identification and refinement. (A) Four example peaks identified by IonSpec. X- and Y-axes represent m/z and intensity, respectively. The areas of the individual peaks are denoted in blue. (B) Gaussian fitting results. The fitted curves are represented by red lines while the raw intensity profiles are represented by either the blue or black line. (C) Goodness-of-fit measures showing SSE, SST, and R2.

’ RESULTS AND DISCUSSION TOF-SIMS Analysis. We generated 43 TOF-SIMS data sets from seven gastric cancer and eight normal tissues obtained from patients and volunteers undergoing endoscopic biopsy (Figure S-1A in Supporting Information). The characteristics of the patients are summarized in Table S-1 in Supporting Information. For each tissue sample, we performed a serial section at 10 μm-thickness. One section was used for H&E staining to identify multiple areas that were enriched with normal epithelial or cancer cells (Figure S-1B and C in Supporting Information). The corresponding areas in the other section (100  100 μm2) were then analyzed using TOFSIMS in both the positive and negative modes. Two or three different areas in each tissue sample were analyzed to account for cellular heterogeneity reflecting the varying states of normal epithelial or tumor cells within the sample. A total of 43 sets of positive and negative ion spectra were generated from seven cancer and eight normal samples. Peak Identification and Refinement. For the 43 sets of positive and negative spectra, we identified a total of 87 138 ion peaks using the TOFBAT program in IonSpec software (version 4.1; Figure S-1A in Supporting Information) as described in Materials and Methods. Each peak was defined by m/z, m/z range, and the peak area representing the abundance of the corresponding ion. Close examination of these peaks, however, revealed that some of the peaks, especially the ones for low abundant ions, often become irregular due to the corruption of chemical noise, resulting in unreliable information of m/z and the areas. To remove these abnormally shaped peaks with the significant errors in such information, we applied Gaussian fitting to each peak identified by IonSpec. Figure 1A shows four example peaks identified by IonSpec. After Gaussian fitting to each peak, we evaluated goodness-of-fit (R2) (Figures 1B and 1C; Method 1 in Supporting Information). We then selected the peaks with high goodness-of-fit (R2 g 0.831 34): Cases 1 and 4

where IonSpec correctly identified the peaks. In contrast, we removed the abnormal peaks with R2 < 0.8 from the list of peaks: Cases 2 and 3 where IonSpec incorrectly identified two ion features (the arrows in Figure 1B) as single peaks. After the removal of the abnormal peaks, on average, 33% of the peaks identified by IonSpec for each spectrum (28 708 out of 87 138 ion peaks) were selected for the further analyses. For analysis of TOF-SIMS data, the mass binning method has been used to reduce the number of variables (m/z) and to circumvent the sparse nature of the TOF-SIMS data from inhibiting meaningful analyses.35 The mass binning method reduces the data into the sampled m/z values and their intensities, whereas TOFSIMS-P reduces the data into the peak m/z values and peak intensities. This difference between the two methods (i.e., sampled bin m/zs vs peak m/zs) affects several aspects of TOFSIMS data analysis, such as quantification of ions, MVA-based classification, and identification of the discriminatory ions (See Method 2 in Supporting Information). Peak Alignment. For comparative analysis of abundances of the peaks identified from multiple groups of samples, the same peaks in different samples should be combined, called peak alignment (Figure S-1A in Supporting Information). We developed a template-based alignment method that involves the four steps shown in Figure 2. First, the peaks from all the samples were merged and sorted by their intensities in the descending manner (Figure 2A). Second, we then generated a template to be used as a reference to align the peaks from all the samples. As moving down from the peak with the largest intensity in the list in Figure 2A, only nonredundant peaks were added to the template while redundant ones were removed (Figure 2B; see Materials and Methods for detail). For the peaks from the 43 spectra, the resulting template consisted of 705 positive and 581 negative nonredundant ion peaks. Third, the peaks from each spectrum are aligned to their most overlapping reference peaks in the template (Figure 2C; see Materials and Methods for detail). 9300

dx.doi.org/10.1021/ac2016932 |Anal. Chem. 2011, 83, 9298–9305

Analytical Chemistry

ARTICLE

Figure 2. Peak alignment. (A) Peaks sorted by their intensity in the descending manner. (B) Generation of a template including nonredundant peaks. (C) Alignment of the peaks in the individual spectra to the reference peaks in the template. (D) Assessment of the alignment results using the similarity measures (see text).

After the alignment of both positive and negative ion peaks, we normalized the intensities of positive and negative ion peaks using the quantile normalization method.36 Among the peaks in the template, we then selected the ones that were detected in more than 15 cancer samples and 18 normal samples (75%) to ensure statistical reliability in the following analyses: for the positive and negative ion peaks, we selected 233 positive (33%) and 225 negative (38%) ion peaks, respectively. Fourth, we assessed the performance of the alignment by evaluating the similarity of all the possible pairs of the 43 data sets (Figure 2D). For each pair of data sets, we evaluated the number of the peaks commonly detected in the two samples (Soverlap), the correlation of peak intensities (Sint), and their product as a combined similarity measure (Ssim), as previously described.37 The high similarity scores (Ssim > 0.9) in Figure 2D indicate that the peaks from different samples were successfully aligned, considering that a majority of the peaks are not likely to be changed in their abundances across the samples. Selection of Discriminatory Ions. We then identified the ion peaks, called discriminatory ions (Figure S-1A in Supporting

Information), whose abundances differed between gastric cancer and normal tissues. For all aligned peaks (233 positive and 225 negative ion peaks), we first performed the ANOVA test38 and selected 213 ion peaks with p-value less than 0.01 (Figure 3A). The ANOVA test evaluates the significance that the individual peaks can separate cancer tissues from normal ones independently. However, the significance should be evaluated based on the collective contribution of the individual peaks to the separation. To evaluate the collective contribution, we applied a multivariate classification analysis, partial least square-discriminant analysis (PLS-DA) to the intensities of the 213 peaks selected from the ANOVA test. For an unbiased PLS-DA, we performed 10-fold cross-validations (CVs) 1000 times. The CV experiments were repeated as increasing the number of PLS LVs, which revealed that two PLS latent variables (LVs) resulted in the smallest misclassification error rate (Figure S-2 in Supporting Information). Using the two PLS LVs, we then performed the CVs 1000 times again and computed the mean of “variable importance in projection (VIP)” of the peaks. A large VIP value indicates that the corresponding peak 9301

dx.doi.org/10.1021/ac2016932 |Anal. Chem. 2011, 83, 9298–9305

Analytical Chemistry

ARTICLE

Figure 3. Selection of discriminatory ions and their associated metabolic pathways. (A) An integrative method for selection of discriminatory peaks based on ANOVA and PLS-DA. (B) Sample separation achieved by PLS-DA. The variances of 76.59% and 6.57% in X-block were explained by LV1 and LV2, respectively. (C) Results of the incremental cross-validations. The green dots and the red lines on the box plot represent the mean and the median of the misclassification error rates, respectively. (D) List of metabolites that were identified by the HMDB search and have pathway data according to the KEGG compound database. (E) Box plots of two example peaks up- and down-regulated in the gastric cancer. (F) Metabolic pathways in which the discriminatory peaks are involved.

has high collective contribution to the separation between cancer and normal tissues. A final set of 66 discriminatory peaks were then selected as the ones with the mean VIP > 1 (Figure 3A; see Method 3 in Supporting Information for detail).

Interestingly, 63 out of 66 peaks were increased in their abundance in cancer samples, compared to normal ones (Table S-2 in Supporting Information). Figure 3B shows that the PLSDA model constructed with the 66 discriminatory peaks can 9302

dx.doi.org/10.1021/ac2016932 |Anal. Chem. 2011, 83, 9298–9305

Analytical Chemistry

ARTICLE

Figure 4. Network describing amino acid and nucleotide metabolic pathways deregulated in cancer. Node colors represent up- (red) and downregulation (green) in gastric cancer.

successfully discriminate gastric cancer samples from normal ones using the first two LV scores, resulting in the mean misclassification error rate = 0.02. We further evaluated the robustness of the performance of PLS-DA using the 66 ion peaks against the variation in the size of the training set by performing the incremental-fold CVs. Figure 3C shows that the mean misclassification error rate remains small (