SMART: Statistical Metabolomics Analysis—An R ... - ACS Publications

Jun 1, 2016 - ABSTRACT: Metabolomics data provide unprecedented opportunities to decipher metabolic mechanisms by analyzing hundreds to thousands ...
0 downloads 0 Views 713KB Size
Article pubs.acs.org/ac

SMART: Statistical Metabolomics AnalysisAn R Tool Yu-Jen Liang,†,‡ Yu-Ting Lin,‡ Chia-Wei Chen,‡ Chien-Wei Lin,‡ Kun-Mao Chao,†,§,+ Wen-Harn Pan,∥,+ and Hsin-Chou Yang*,‡,⊥,#,¶,△,+ †

Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei 10617, Taiwan Institute of Statistical Science, Academia Sinica, Taipei 115, Taiwan § Department of Computer Science and Information Engineering, National Taiwan University, Taipei 10617, Taiwan ∥ Institute of Biomedical Sciences, Academia Sinica, Taipei 115, Taiwan ⊥ Institute of Public Health, National Yang Ming University, Taipei 11221, Taiwan # Department of Statistics, National Cheng Kung University, Tainan 701, Taiwan ¶ Institute of Statistics, National Tsing Hua University, Hsinchu 30013, Taiwan △ School of Public Health, National Defense Medical Center, Taipei 114, Taiwan ‡

S Supporting Information *

ABSTRACT: Metabolomics data provide unprecedented opportunities to decipher metabolic mechanisms by analyzing hundreds to thousands of metabolites. Data quality concerns and complex batch effects in metabolomics must be appropriately addressed through statistical analysis. This study developed an integrated analysis tool for metabolomics studies to streamline the complete analysis flow from initial data preprocessing to downstream association analysis. We developed Statistical Metabolomics AnalysisAn R Tool (SMART), which can analyze input files with different formats, visually represent various types of data features, implement peak alignment and annotation, conduct quality control for samples and peaks, explore batch effects, and perform association analysis. A pharmacometabolomics study of antihypertensive medication was conducted and data were analyzed using SMART. Neuromedin N was identified as a metabolite significantly associated with angiotensin-converting-enzyme inhibitors in our metabolome-wide association analysis (p = 1.56 × 10−4 in an analysis of covariance (ANCOVA) with an adjustment for unknown latent groups and p = 1.02 × 10−4 in an ANCOVA with an adjustment for hidden substructures). This endogenous neuropeptide is highly related to neurotensin and neuromedin U, which are involved in blood pressure regulation and smooth muscle contraction. The SMART software, a user guide, and example data can be downloaded from http://www.stat.sinica.edu.tw/hsinchou/metabolomics/SMART.htm.

L

XCMS Online,11,18,19 are available for performing statistical and pathway analyses of metabolomics data. Recently, numerous large-scale metabolome-wide association studies (MWASs) have been conducted.20−24 Such studies are characterized by a long study time, complex experimental factors, and a high number of samples with multiple technical and biological replicates;25,26 technical replicates are designed to detect system variability and biological replicates are used to increase the statistical power.26,27 The experiments of largescale MWASs may be influenced by several sources of variation, including controllable and uncontrollable factors. The controllable factors, such as sample characteristics and known batch effects (e.g., a batch effect due to a change of an LC column28 or a working solution of O-methoxylamine hydrochloride29), can be considered during study design. Uncontrollable factors, such as specimen contamination and unknown batch effects

iquid chromatography−mass spectrometry (LC−MS) and gas chromatography−mass spectrometry (GC−MS) have high sensitivity and selectivity in rapidly separating and quantifying target or nontarget compounds, including metabolites and lipids, in biological fluids.1 Recently, metabolomics profiling has been used widely for exploring the associations of metabolites with disease states,2 studying drug metabolism,3 performing clinical diagnostics,4 understanding toxicity mechanisms,5 and investigating human nutrition.6 The results have facilitated understanding the roles of metabolites in biological mechanisms.7−9 As the frequency of metabolomics studies has increased, several analysis tools have become available. Commercial software for metabolomics data processing, peak alignment, and data visualization is provided with experimental instruments such as Mass Profiler Pro (Agilent Technologies, Palo Alto, CA, USA), ProfileAnalysis (Bruker, Billerica, Massachusetts, USA), and MarkerLynx (Waters, Milfold, MA, USA). Moreover, free packages, such as MZmine,10 XCMS,11 and MultiAlign,12 have been developed. In addition, some webbased tools, such as MetaboAnalyst,13−15 MeltDB,16,17 and © XXXX American Chemical Society

Received: February 14, 2016 Accepted: May 20, 2016

A

DOI: 10.1021/acs.analchem.6b00603 Anal. Chem. XXXX, XXX, XXX−XXX

Article

Analytical Chemistry (e.g., a batch effect due to period-to-period variation in LC column performance30), should be appropriately addressed during statistical analysis. Poor specimen quality due to contamination interferes with peak alignment and increases data noise. Unknown batch effects, commonly observed in large-scale MWASs,31 render comparing studied groups in different batches difficult.32,33 Without appropriate data quality control and statistical modeling, false positives and negatives in statistical associations increase substantially. These concerns cannot be completely resolved by using any of the available software. To facilitate addressing the challenges of statistical analysis in large-scale MWASs, this study developed an integrated analysis tool, Statistical Metabolomics AnalysisAn R Tool (SMART), for MWASs to streamline the complete analysis flow from initial data preprocessing to downstream association analysis. Analysis functions in SMART are summarized and compared to existing tools (Table S1).

The second format is the format of raw spectrum data, such as .d files from Agilent Technologies (Santa Clara, CA, USA) and .raw files from Waters (Milford, MA, USA). SMART converts these files to mzXML files by incorporating the msconvert function in ProteoWizard 3.0.35,36 The third format is the comma-delimited peak abundance data of mass spectra. The peak abundance data can be obtained from widely used programs such as Markerlynx (an add-in to Masslynx, Waters (Milford, MA, USA)), MZmine,10 MetAlign,37 and XCMS.11 The peak abundance data file contains the mass-to-charge ratio (m/z), chromatographic retention time (RT, in seconds or minutes), and peak abundance or metabolite concentration of all replicate samples of a study subject. Data Visualization. SMART supports visualization of multiple data types (Figure S1A−E). First, SMART can display raw mass spectra in any user-specified m/z and RT region for the replicate samples of interest in a two-dimensional plot (Figure S1A) or three-dimensional plot (Figure S1B). The spectrum figures facilitate monitoring the peak shapes and the difference in peak profiles between replicate samples. Second, SMART extracts the total ion current (i.e., the sum of ion intensities of all masses during a specific RT scan) by using the mzR package of Bioconductor.35 SMART can then display the total ion chromatogram (TIC), that is, the profiling of all total ion currents, for each replicate sample (Figure S1C). Third, SMART can display cross-replicate-sample and cross-subject TIC overlay plots (Figure S1D). Finally, SMART provides a TIC cluster diagram of all replicate samples of all subjects for an initial evaluation of sample contamination or instrument failure (Figure S1E). Peak Alignment (Peak Detection and RT Alignment) and Annotation. SMART conducts peak detection and RT alignment by incorporating matched filtration11 and centWave38 in XCMS,11 which is one of the most commonly used tools for peak detection and RT alignment. Figure S2 shows the interface of peak alignment. For details on peak alignment, refer to XCMS. In addition, SMART supports parallel computation by applying the R package snow. The applicable number of CPU cores for implementing peak detection and RT alignment (i.e., nSlaves) must be specified in SMART. After peak alignment, the peak abundance can be obtained by calculating the integrated area or the maximum intensity under the original peak before or after peak filtering through the matched Gaussian filter approach.11 SMART detects potential isotopes, adducts, and fragments by incorporating peak annotation in CAMERA.39 The redundant isotopic peaks, unwanted adducts, and daughter ion fragments can be moved from the subsequent analyses by users optionally. Finally, SMART exports the comma-delimited peak abundance data, including the peak index, m/z, RT (seconds), and peak abundance of every replicate sample. Data Preprocessing. Data preprocessing is a major step before performing subsequent statistical data analysis. The peak abundance data of a mass spectrum are transformed and normalized immediately after they are obtained. SMART provides three procedures for data transformation and normalization (Figure S3). The first procedure is abundance adjustment by using an internal standard (e.g., creatinine). The second procedure is to consider the generalized log2 transformation,40 which is a variance-stabilization procedure and can avoid the problem of log transformation of a zero value. The third procedure is sample-based data normalization that



MATERIALS AND METHODS SMART Description. SMART having a user-friendly interface was developed in R and the R graphical user interface in the Windows and Mac operating systems. It provides an integrated analysis tool for GC/LC−MS-based metabolomics data. The utilities of SMART are elaborated in the following subsections according to the SMART workflow (Figure 1). Data Import. SMART supports multiple input file formats. The first format is the mzXML file format, which is an extensible markup language (XML) file format for MS data.34

Figure 1. Flowchart of SMART. The main procedures are written in bold. The parallelograms show different input file formats. The solid line indicates the main analysis procedure of SMART and the dashed line indicates realignment after exclusion of poor peaks and replicate samples. B

DOI: 10.1021/acs.analchem.6b00603 Anal. Chem. XXXX, XXX, XXX−XXX

Article

Analytical Chemistry

the s-value is expressed as Q3(h) + 1.5 × IQR(h). For the bootstrap method, the s-value is calculated as the sum of the mean and the standard deviation of bootstrap samples of h multiplied by three by using the adjusted bootstrap percentile42 of the R package boot. Finally, a suggestion for replicate sample removal is provided (lower-right plot in Figure S4C). Batch Effect Detection. SMART can detect batch effects due to known experimental conditions, unknown latent groups (LGs), or hidden substructures (Figure S5A). These three sources of batch effects can then be adjusted for in a statistical model (see Association Analysis) by including known batcheffect covariates from prior knowledge, LG variables from a cluster analysis, or principal components (PCs) from a PC analysis (PCA). SMART provides a heat map for determining the number of LGs (Figure S5B). Moreover, it investigates the relationship between the known covariates and the LGs. For example, if the pattern of the experiment date is matching to that of the pattern of LGs, then it reveals that the experiment date is one of the major batch effects. Moreover, SMART can filter out LGs with outlier subjects (Figure S5B); for details, refer to the SMART user guide. In addition, SMART conducts PCA of the peak abundance (Figure S5C). Users can determine the number of PCs by using a scree plot or the proportion of variance explained by PCs > 1%. In the PCA plots, the PCs are colored according to known covariates and used to evaluate the relationship between the known covariates and the PCs (Figure S5C). Association Analysis. To discover the association between metabolites and the main factor groups (e.g., case vs control) or quantitative traits of interest (e.g., blood pressure), SMART provides a general Analysis of Covariance (ANCOVA) model, including factorial, nested, and crossed designs, for MWASs (Figure S6A−B). In this flexible model, the dependent variable is the peak abundance and the independent variables include the factor groups or quantitative traits of interest, covariates, and batch effects. Moreover, SMART enables conducting multiple-group analysis; for details, refer to the SMART user guide. SMART provides two methods for evaluating the statistical significance of the association between the metabolites and the factor groups. When the peak abundance follows a normal distribution, the F test is used; otherwise, a permutation test, which randomly shuffles the values of the factor groups, can be used. Nominal p-values (pv), adjusted p-values after false discovery rate (FDR)43 adjustment (pFDR), empirical p-values (epv), and empirical p-values after FDR adjustment (epFDR) are exported. A volcano plot is used to present the results of an MWAS. To reduce the computational time of permutations, SMART supports parallel computing by applying the snow package. In addition, to avoid multicollinearity among the independent variables in ANCOVA, SMART calculates the variance inflation factor (VIF)44 for evaluating multicollinearity. The default cutoff of the VIF is 10.45 An LG or PC is removed from the ANCOVA model if its VIF is higher than the cutoff. Antihypertensive Pharmacometabolomics Data. We recruited 100 young-onset hypertensive patients from the Academia Sinica Multi-Center Young-Onset Hypertension Study. The patients were divided into two groups: the medication group comprised 50 patients who were treated with angiotensin-converting-enzyme inhibitors (ACEis) and the nonmedication group (NonMed) comprised 50 patients who were not treated with antihypertensive medicine. Refer to Yang et al.46 for information on the inclusion and exclusion criteria

includes scaling to the mean, scaling to the median, standardization, and quantile normalization.41 Quality Control. Peak Filtering. In the peak alignment procedure, some peaks may have a (positive) missing value or zero value in a proportion of replicate samples. SMART calculates the missing and zero rates of each peak. Users must specify cutoffs for the zero and missing rates. If the missing or zero rate of a peak is higher than the cutoff, the peak is removed (Figure S4A). Sample Filtering. Poor-quality replicate samples and subjects are removed through hierarchical cluster analysis based on high-quality peaks as follows (Figure S4A). First, SMART calculates an r-value as the sum of squares within a subject divided by the sum of squares between subjects based on the aligned peak abundance data. The r-value is a quality index for peaks; the higher the r-value is, the poorer the peak quality. Moreover, SMART calculates quantiles of the r-value. Users select a preferred quantile value or assign a nonnegative value as a cutoff of the r-value (Figure S4B). In the following steps, the quality of the replicate samples and subjects is evaluated only relying on the remaining high-quality peaks that have an r-value lower than the specified cutoff (upper-left plot in Figure S4C). Second, SMART calculates a proximity measure, Pearson’s correlation coefficient, for any pair of replicate samples from the same or different subjects based on the abundance data of the remaining peaks in the first step. This proximity measure is then transformed into a distance measure by subtracting Pearson’s correlation coefficient from one and dividing the difference by two. The greater the distance measure is, the greater the difference between the two replicate samples (the distance will be used in the third step). For each subject, an inconsistency index is calculated by computing the sample mean of all log-transformed pairwise distance measures from the replicate samples of the subject. The higher the inconsistency index is, the poorer the data quality of the study subject (the inconsistency will be used in the final step). Third, all subjects are categorized as “perfectly clustered,” “well clustered,” and “poorly clustered” according to a cluster analysis of pairwise distances of replicate samples as follows. A subject is perfectly clustered if all of its replicate samples are clustered together, and a subject is well clustered if at least twothirds of its replicate samples are clustered together; otherwise, the subject is poorly clustered. All replicate samples of perfectly clustered subjects are retained and all replicate samples of poorly clustered subjects are removed from the subsequent analysis (lower-left plot in Figure S4C). Finally, replicate samples of well-clustered subjects are identified as poor-quality candidates and some of them are removed as follows: the log-transformed distance h from the nonclustered replicate sample to the major group of the clustered replicate samples of the same subject is calculated using the average linkage, complete linkage, or Ward’s method in a hierarchical cluster analysis. Let mean(h), sd(h), Q3(h), and IQR(h) represent the sample mean, standard deviation, third quantile, and interquartile range of inconsistency indices, respectively, of the perfectly clustered subjects. A replicate sample is removed if the log-transformed distance h is greater than the threshold s-value (the upper-right plot in Figure S4C). If h follows a normal distribution, the s-value is expressed as mean(h) + 3 × sd(h). If h violates a normal distribution, SMART provides a nonparametric rank-based method and a bootstrap method. For the nonparametric rank-based method, C

DOI: 10.1021/acs.analchem.6b00603 Anal. Chem. XXXX, XXX, XXX−XXX

Article

Analytical Chemistry

Figure 2. Quality control plot for sample filtering. In this plot, the vertical axis indicates the log-transformed distance h, and the horizontal axis indicates the number of steps at which all replicate samples of a subject were joined together. Each symbol indicates a subject. All subjects were classified as perfectly clustered (black), well clustered (green), or poorly clustered (red). The reference line (s-value = 0.027) was obtained using the normality-based method based on all peaks (i.e., r-value ≤ Inf). Subjects with a log-transformed distance h ≥ 0.027 are indicated by a circle, whereas the other subjects are indicated by a triangle. The subject nomenclature consists of the subject name followed by the replicate sample names. According to the sample filtering rules (see Quality Control in the Materials and Methods), all replicate samples of “black” subjects were retained and all replicate samples of “red” subjects were removed. Favorable replicate samples of well-clustered subjects (green circle) were retained and poor replicate samples of well-clustered subjects (green triangle) were removed. In this study, the sixth replicate sample of subject ACEI41 (ACEI41_6) and the fifth replicate sample of subject N7 (N7_5) were removed.

Peak Alignment and Annotation. According to the prespecified parameters of peak alignment (Table S2), we obtained 1,012 peaks. Debrisoquine (HMDB06543) was an internal standard in the experiment. After the peak alignment, debrisoquine had an m/z of 176.1188 (M+H) and an RT of 69.89 s. The m/z was very close to the monoisotopic molecular weight of debrisoquine, 175.1109 Da, obtained from HMDB (http://www.hmdb.ca/). In all replicate samples, debrisoquine can be detected after a peak detection and RT alignment procedure. The results implied that our peak alignment was accurate. The average abundance of debrisoquine among 599 replicate samples was 276.77 (sd = 86.86). In this analysis, peak annotation was performed and used to remove potential redundant peaks after peak realignment. Data Preprocessing. First, we scaled the peak abundance of every peak to the mean abundance of debrisoquine among all replicate samples. Next, debrisoquine was excluded from the peak table. We used generalized log2 transformation to transform the abundance data. Quality Control. This study focused on the RT from 20 to 195 s and m/z from 50 to 990. In this plane of the RT and m/z, 775 peaks were available for quality control. In the peak filtering procedure, no peak was excluded because of a high missing rate or zero rate. In the sample filtering procedure, when an average linkage method was used and all peaks were retained (i.e., r-value ≤ Inf), we obtained a maximum sample accuracy of 99.67% (Figure S9A); the numbers of poorly, well, and perfectly clustered subjects under this setting were 0, 3, and 97, respectively (Figure S9B). The sample accuracy was

for hypertensive patients. Fasting serum samples were collected from all hypertensive patients and then analyzed through LC− MS by using an ACQUITY-UPLC system (Waters Corp, Milford, MA, USA) and a Waters SYNAPT HDMS system (Waters Corp, Milford, MA, USA). Six technical replicate samples were used for each of the 100 patients. All of the experiments were completed in 10 days; an experiment with 60 replicate samples being collected from 10 patients per day.



RESULTS

SMART Software. The SMART software, the user guide, and an illustrative example can be downloaded from the SMART Web site (http://www.stat.sinica.edu.tw/hsinchou/ metabolomics/SMART.htm). Illustrative Example. Among the 600 replicate samples from the 50 patients in the ACEi group and 50 patients in the NonMed group, one replicate sample (the third replicate sample of patient ACEI30) was excluded by the laboratory in advance because of poor quality. Figure S7 presents the workflow and results of this antihypertensive pharmacometabolomics data analysis. Data Visualization. TIC overlay plots of six replicate samples of every patient were drawn; an example for patient ACEI1 is shown (Figure S8A). A cluster analysis based on the TIC values of 599 replicate samples of 100 patients showed that 91% of the patients were at least well clustered using the average linkage method (Figure S8B). Poor-quality replicate samples will be evaluated to remove after peak alignment. D

DOI: 10.1021/acs.analchem.6b00603 Anal. Chem. XXXX, XXX, XXX−XXX

Article

Analytical Chemistry

ANCOVA. We set the absolute value of the fold change to >0.5 as a cutoff for biologically meaningful peaks. Of the 63 significant peaks, 5 were prioritized as biologically meaningful peaks. Table S3 lists these 5 significant molecular peaks. For peak identification, we acquired the m/z of the 5 peaks from HMDB according to the following conditions: (1) the adduct type was M + H, M + Na, or M + K and (2) molecular weight tolerance was 0.01 Da. We observed that one peak matched more than one metabolite, two peaks matched only one metabolite (Table S4), and two peaks did not match any metabolite.

calculated by dividing the sum of the total number of perfectly clustered replicate samples (581) and the number of retained well-clustered replicate samples (16) by the total number of replicate samples (599). According to the normality-based method, the s-value was 0.027. We removed the sixth replicate sample of patient ACEI41 in the ACEi group (ACEI41_6) and the fifth replicate sample of patient N7 in the NonMed group (N7_5) because their log-transformed distance h was higher than 0.027 (Figure 2). The original mzXML data of the 597 valid replicate samples were realigned on the basis of the same parameters of peak alignment (Table S2). Peak annotation was performed and redundant isotopic peaks, daughter ion fragments, and adducts except for M + H, M + Na, or M + K were removed; there were 392 peaks remained. The realignment peak abundance data was renormalized to the mean abundance of the internal standard and transformed using generalized log2 transformation. No peaks were excluded because of a high missing or zero rate. Finally, we obtained the normalized peak abundance data of 299 peaks for the 100 patients. Batch Effect Detection. Through PCA, we selected the first three PCs according to a scree plot; the three PCs explained 38.43% of the total peak abundance variation (Figure S10A). PCA plots of the normalized peak abundance data showed the pattern of the batch effect (Figure S10B). In addition, a heat map and cluster tree diagram showed a substructure of four LGs (Figure S10C). The third and fourth LGs contained patient ACEI47 and N12, respectively. These two patients were not similar to other patients, indicating that they were outliers. The two outliers were removed from the subsequent analysis by treating their data as missing values. We did not observe any similar patterns between the known covariates (age, sex, body mass index (BMI), and the month of the experiment) and the two LGs (Figure S10D). Association Analysis. To identify the peaks associated with ACEis, we used the nested ANCOVA model, where age, sex, BMI, and the batch effects, including the month of the experiment, the PC scores of the first three PCs or the two LGs, were adjusted for. To evaluate multicolinearity, we examined the VIF values of the major factor variable and covariates. The highest value was 1.74 and pertained to the association between the first PC and the month of the experiment. Because no VIF was greater than 10, all batch effect variables were adjusted for in the ANCOVA model. We considered two ANCOVA models, where age, sex, BMI, and month of the experiment were included as covariates. In the first model, we adjusted for the three PC scores; this model is called as the PCA−ANCOVA model. In total, 123 molecular peaks were nominally significant (pv < 0.05) (Figure S11A) and were further examined through a permutation test with 20 000 random shuffles (Figure S11B−C). Furthermore, 119 of the 123 peaks had an epv < 0.05, and 96 of the 119 significant peaks had an epFDR < 0.05. In the second model, we adjusted for the two LGs; this model is called as the LG−ANCOVA model. In total, 133 molecular peaks were marginally significant (pv < 0.05) (Figure S11D) and were further examined through a permutation test with 20 000 random shuffles (Figure S11E− F). Furthermore, 132 of the 133 peaks had an epv < 0.05, and 108 of the 132 significant peaks had an epFDR < 0.05. An intersection of the 96 significant peaks from the PCA− ANCOVA model and 108 significant peaks from the LG− ANCOVA model contained 63 peaks. In this study, we defined fold change as the estimated coefficient of the major factor from



DISCUSSION In addition to visualization of raw MS data, SMART provides a TIC cluster diagram for displaying the cluster patterns of the TIC for all replicate samples and subjects. This diagram provides an initial evaluation of the data. If the TIC profiles of multiple replicate samples of one subject are not clustered well, some replicate samples of this subject may have been interfered with sample contamination or instrument failure. Moreover, the TIC may be influenced by background noise or other experimental errors, and some of these noises can be adjusted in the subsequent procedures of peak alignment. Therefore, we recommend using the TIC cluster diagram for an initial examination of data quality but not for removing replicate samples at this stage. SMART uses the XCMS package11 for peak alignment. Determining the parameters of peak alignment is crucial. Several studies have discussed how to select optimal parameter settings for peak alignment.47,48 We suggest that the parameter settings be determined on the basis of the performance of the postalignment data of an internal standard. The criteria include the missing rate of abundance data, m/z accuracy, and coefficient of variation of the peak abundance of the internal standard. For example, in the antihypertensive pharmacometabolomics study, three profile methods (i.e., bin, binlin, and binlinbase) of XCMS were considered. Finally, the binlin method was used because it exhibited the highest performance: (1) all replicate samples obtained the abundance data of the internal standard; (2) the m/z of the internal standard in our data was very close to the ideal value from HMDB; and (3) the smallest coefficient of variation was achieved (Table S5). SMART users who prefer other peak alignment tools (e.g., MZmine,10 MetAlign,37 and MultiAlign12) can import the postalignment data into SMART by providing a peak abundance table (see the Materials and Methods). SMART implements several data preprocessing procedures, including abundance adjustment according to an internal standard, variance-stabilization transformation, and samplebased data normalization. Determining the optimum combination of the procedures is critical. We suggest determining the procedure combination on the basis of the sample accuracy (see Quality Control in the Results). For example, in the antihypertensive pharmacometabolomics study, the sample accuracy was 90.15% when we employed only abundance adjustment according to an internal standard. The sample accuracy increased to 99.67% (only two of the 599 replicate samples were removed) when the generalized log2 transformation was considered. However, none of the sample-based data normalization procedures in SMART can further increase the sample accuracy. Therefore, we recommend a combination of abundance adjustment according to an internal standard and variance-stabilization transformation. E

DOI: 10.1021/acs.analchem.6b00603 Anal. Chem. XXXX, XXX, XXX−XXX

Article

Analytical Chemistry

lowering effect associated with the downregulation of neuromedin N in the ACEi group may be in part explained by the upregulation of neurotensin. Neurotensin shares significant sequence similarity in its six C-terminal amino acid residues with several other neuropeptides including neuromedin N. Neuromedin was named for its ability to cause smooth muscle cell contraction. For example, neuromedin U, the most extensively investigated neuromedin, was isolated from the spinal cord. This neuropeptide has several functions including smooth muscle contraction in the uterus and blood pressure regulation51,52 SMART can be further enhanced by adding functions. First, we are adding more functions for data preprocessing (e.g., peak-based data normalization and sample specific normalization) and association analysis (e.g., time-course analysis and logistic regression analysis). Second, several metabolite databases (e.g., HMDB,53−55 METLIN,56 and MassBank57) and tools for peak or metabolite identification (e.g., XCMS Online,11,18,19 MIDAS,58 and Metabolome searcher59) have been developed. We are developing interactive modules for providing a convenient link between the identified peaks and metabolites and the public databases, facilitating biological explanations. Finally, pathway analysis facilitates understanding biological mechanisms.60,61 We are developing new metabolite set enrichment methods for pathway analysis, and these new functions will be added to SMART.

SMART provides quality evaluation for peak features by using the r-value and cluster analysis, and provides quality control methods (i.e., methods for calculating the s-value) for removing poor-quality replicate samples and subjects. The cutoff of the r-value and linkage methods (i.e., average, complete, and Ward’s linkage) in the cluster analysis should be determined. Moreover, three methods should be evaluated for calculating the threshold of the s-value (i.e., normalitybased, nonparametric rank-based, and bootstrap methods) (see Quality Control in the Materials and Methods). Similar to data preprocessing, we suggest determining the optimum threshold or methods on the basis of the maximum sample accuracy. For example, in the antihypertensive pharmacometabolomics study, the average linkage method combined with a cutoff of the rvalue of Inf (i.e., no peaks were removed) (Figure S9A) and the normality-based method for the s-value (Figure 2) were used, and this combination yielded superior results; 0%, 3%, and 97% of the subjects were poorly clustered (red), well clustered (green), and perfectly clustered (blue), respectively (Figure S9B). The sample accuracy was 99.67%; only two replicate samples (ACEI41_6 and N7_5; the green triangle in Figure 2) were removed. The TIC plots of these two replicate samples indeed differed from other replicate samples of the same subject. As shown in Figure 2, ACEI44_5 and ACEI44_6 were well clustered replicate samples (green circle) but not clustered with the major group of replicate samples of the subject ACEI44. They were not removed from the subsequent analysis because their distances did not exceed the cutoff of the s-value (i.e., 0.027). Two other replicate samples, ACEI45_5 and ACEI45_6 (black triangle) (i.e., ACEI45_5_6 in Figure 2), were not removed because they were perfectly clustered. After removal of the poor-quality replicate samples, realignment is necessary. SMART provides a convenient realignment function for postquality-control data (see Quality Control in the Results). SMART adjusts for batch effects by modeling the known experimental conditions, unknown LGs, and hidden substructures. Avoiding multicollinearity in an ANCOVA model is crucial because it may cause numerical difficulties during parameter estimation and hypothesis testing in the association analysis. Moreover, multicollinearity among adjustment factors may cause overadjustment in an ANCOVA model. We suggest using the VIF to evaluate multicollinearity. In the antihypertensive pharmacometabolomics study, the VIFs of the month effect, three PCs, and two LGs were less than the default value, 10, implying that there was no multicollinearity. Our MWAS identified five metabolites, annotated by HMDB, significantly associated with ACEi antihypertensive medications. One of the metabolites, neuromedin N (RT = 66.9344 and m/z = 656.3392), is an endogenous metabolite and had a lower metabolite abundance in the ACEi group than in the NonMed group; the fold changes were −0.8765 and −0.7320 in the LG−ANCOVA and PCA−ANCOVA models, respectively (Table S3). Neuromedin N is an endogenous neuropeptide derived from the same precursor polypeptide as neurotensin.49 Neurtensin, a 13-amino-acid peptide originally isolated from the bovine hypothalamus,50 can produce pronounced vasodilation in the exposed cutaneous regions of anesthetized rats.50 Notably, neurotensin can be enhanced using ACEis.49 Our data showed that a peak of neurotenin (HMDB13024) negatively associated with the peak of neuromedin N (HMDB13022); Pearson’s correlation of the two peaks was −0.178. We hypothesize that the blood pressure



CONCLUSIONS We developed the user-friendly software SMART under GPL_v2 license for integrated analysis of metabolomics data. SMART can process input files with various formats, visually represent various types of data features, implement peak alignment, conduct quality control, explore batch effects, and perform association analysis for MWASs.



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.analchem.6b00603. Analysis functions in SMART and a comparison to existing tools; parameter settings during peak detection and RT alignment; five significant peaks associated with ACEis; two significant peaks associated with ACEis and matched only one metabolite from HMDB; summary statistics of the peak abundance of the internal standard; data visualization, interface of peak alignment and annotation, data preprocessing interface, quality control interfaces, batch effect detection interface, and association analysis interface in SMART, workflow and results of the antihypertensive pharmacometabolomics data analysis; and data visualization of TIC, quality control of the data, batch effect detection of the data, and association analysis in the pharmacometabolomics study (PDF)



AUTHOR INFORMATION

Corresponding Author

*Tel: 886-2-27835611 ext. 113. Fax: 886-2-27831523. E-mail: [email protected]. Author Contributions +

F

Equal contribution DOI: 10.1021/acs.analchem.6b00603 Anal. Chem. XXXX, XXX, XXX−XXX

Article

Analytical Chemistry Notes

(25) Fernie, A. R.; Aharoni, A.; Willmitzer, L.; Stitt, M.; Tohge, T.; Kopka, J.; Carroll, A. J.; Saito, K.; Fraser, P. D.; DeLuca, V. Plant Cell 2011, 23, 2477−2482. (26) Want, E. J.; Wilson, I. D.; Gika, H.; Theodoridis, G.; Plumb, R. S.; Shockcor, J.; Holmes, E.; Nicholson, J. K. Nat. Protoc. 2010, 5, 1005−1018. (27) Kamleh, M. A.; Ebbels, T. M.; Spagou, K.; Masson, P.; Want, E. J. Anal. Chem. 2012, 84, 2670−2677. (28) De Livera, A. M.; Dias, D. A.; De Souza, D.; Rupasinghe, T.; Pyke, J.; Tull, D.; Roessner, U.; McConville, M.; Speed, T. P. Anal. Chem. 2012, 84, 10768−10776. (29) Dunn, W. B.; Broadhurst, D.; Begley, P.; Zelena, E.; FrancisMcIntyre, S.; Anderson, N.; Brown, M.; Knowles, J. D.; Halsall, A.; Haselden, J. N.; Nicholls, A. W.; Wilson, I. D.; Kell, D. B.; Goodacre, R. Nat. Protoc. 2011, 6, 1060−1083. (30) Karpievitch, Y. V.; Nikolic, S. B.; Wilson, R.; Sharman, J. E.; Edwards, L. M. PLoS One 2014, 9, e116221. (31) Leek, J. T.; Scharpf, R. B.; Bravo, H. C.; Simcha, D.; Langmead, B.; Johnson, W. E.; Geman, D.; Baggerly, K.; Irizarry, R. A. Nat. Rev. Genet. 2010, 11, 733−739. (32) Gregori, J.; Villarreal, L.; Mendez, O.; Sanchez, A.; Baselga, J.; Villanueva, J. J. Proteomics 2012, 75, 3938−3951. (33) Soneson, C.; Gerster, S.; Delorenzi, M. PLoS One 2014, 9, e100335. (34) Pedrioli, P. G.; Eng, J. K.; Hubley, R.; Vogelzang, M.; Deutsch, E. W.; Raught, B.; Pratt, B.; Nilsson, E.; Angeletti, R. H.; Apweiler, R.; Cheung, K.; Costello, C. E.; Hermjakob, H.; Huang, S.; Julian, R. K.; Kapp, E.; McComb, M. E.; Oliver, S. G.; Omenn, G.; Paton, N. W.; Simpson, R.; Smith, R.; Taylor, C. F.; Zhu, W.; Aebersold, R. Nat. Biotechnol. 2004, 22, 1459−1466. (35) Chambers, M. C.; Maclean, B.; Burke, R.; Amodei, D.; Ruderman, D. L.; Neumann, S.; Gatto, L.; Fischer, B.; Pratt, B.; Egertson, J.; Hoff, K.; Kessner, D.; Tasman, N.; Shulman, N.; Frewen, B.; Baker, T. A.; Brusniak, M. Y.; Paulse, C.; Creasy, D.; Flashner, L.; Kani, K.; Moulding, C.; Seymour, S. L.; Nuwaysir, L. M.; Lefebvre, B.; Kuhlmann, F.; Roark, J.; Rainer, P.; Detlev, S.; Hemenway, T.; Huhmer, A.; Langridge, J.; Connolly, B.; Chadick, T.; Holly, K.; Eckels, J.; Deutsch, E. W.; Moritz, R. L.; Katz, J. E.; Agus, D. B.; MacCoss, M.; Tabb, D. L.; Mallick, P. Nat. Biotechnol. 2012, 30, 918− 920. (36) Kessner, D.; Chambers, M.; Burke, R.; Agus, D.; Mallick, P. Bioinformatics 2008, 24, 2534−2536. (37) Lommen, A. Anal. Chem. 2009, 81, 3079−3086. (38) Tautenhahn, R.; Bottcher, C.; Neumann, S. BMC Bioinf. 2008, 9, 504. (39) Kuhl, C.; Tautenhahn, R.; Bottcher, C.; Larson, T. R.; Neumann, S. Anal. Chem. 2012, 84, 283−289. (40) Durbin, B. P.; Hardin, J. S.; Hawkins, D. M.; Rocke, D. M. Bioinformatics 2002, 18, S105−S110. (41) Quackenbush, J. Nat. Genet. 2002, 32 (Suppl), 496−501. (42) DiCiccio, T. J.; Efron, B. Statistical Science 1996, 11, 189−212. (43) Benjamini, Y.; Hochberg, Y. J. R. Stat. Soc. Ser. B 1995, 57, 289− 300. (44) Fox, J.; Monette, G. J. Am. Stat. Assoc. 1992, 87, 178−183. (45) O’brien, R. M. Quality & Quantity 2007, 41, 673−690. (46) Yang, H. C.; Liang, Y. J.; Wu, Y. L.; Chung, C. M.; Chiang, K. M.; Ho, H. Y.; Ting, C. T.; Lin, T. H.; Sheu, S. H.; Tsai, W. C.; Chen, J. H.; Leu, H. B.; Yin, W. H.; Chiu, T. Y.; Chen, C. I.; Fann, C. S.; Wu, J. Y.; Lin, T. N.; Lin, S. J.; Chen, Y. T.; Chen, J. W.; Pan, W. H. PLoS One 2009, 4, e5459. (47) Libiseller, G.; Dvorzak, M.; Kleb, U.; Gander, E.; Eisenberg, T.; Madeo, F.; Neumann, S.; Trausinger, G.; Sinner, F.; Pieber, T.; Magnes, C. BMC Bioinf. 2015, 16, 118. (48) Uppal, K.; Soltow, Q. A.; Strobel, F. H.; Pittard, W. S.; Gernert, K. M.; Yu, T. W.; Jones, D. P. BMC Bioinf. 2013, 14, 15. (49) Binder, E. B.; Kinkead, B.; Owens, M. J.; Nemeroff, C. B. Pharmacol. Rev. 2001, 53, 453−486. (50) Carraway, R.; Leeman, S. E. J. Biol. Chem. 1973, 248, 6854− 6861.

This paper is based on the PhD dissertation of the first author (Y.J.L.) completed under the supervision of H.C.Y. and K.M.C. The authors declare no competing financial interest.



ACKNOWLEDGMENTS This study was supported by the Ministry of Science and Technology of Taiwan (grant number MOST 103-2314-B-001008-MY3 to H.C.Y).



REFERENCES

(1) Hird, S. J.; Lau, B. P. Y.; Schuhmacher, R.; Krska, R. TrAC, Trends Anal. Chem. 2014, 59, 59−72. (2) Roede, J. R.; Uppal, K.; Park, Y.; Lee, K.; Tran, V.; Walker, D.; Strobel, F. H.; Rhodes, S. L.; Ritz, B.; Jones, D. P. PLoS One 2013, 8, e77629. (3) Liu, X.; Jia, L. Curr. Drug Metab. 2007, 8, 815−821. (4) Patti, G. J.; Yanes, O.; Siuzdak, G. Nat. Rev. Mol. Cell Biol. 2012, 13, 263−269. (5) Clarke, C. J.; Haselden, J. N. Toxicol. Pathol. 2008, 36, 140−147. (6) Gibney, M. J.; Walsh, M.; Brennan, L.; Roche, H. M.; German, B.; van Ommen, B. Am. J. Clin. Nutr. 2005, 82, 497−503. (7) Altuntas, E.; Schubert, U. S. Anal. Chim. Acta 2014, 808, 56−69. (8) Donia, M. S.; Fischbach, M. A. Science 2015, 349, 1254766. (9) Senyilmaz, D.; Virtue, S.; Xu, X. J.; Tan, C. Y.; Griffin, J. L.; Miller, A. K.; Vidal-Puig, A.; Teleman, A. A. Nature 2015, 525, 124− 128. (10) Katajamaa, M.; Miettinen, J.; Oresic, M. Bioinformatics 2006, 22, 634−636. (11) Smith, C. A.; Want, E. J.; O’Maille, G.; Abagyan, R.; Siuzdak, G. Anal. Chem. 2006, 78, 779−787. (12) LaMarche, B. L.; Crowell, K. L.; Jaitly, N.; Petyuk, V. A.; Shah, A. R.; Polpitiya, A. D.; Sandoval, J. D.; Kiebel, G. R.; Monroe, M. E.; Callister, S. J.; Metz, T. O.; Anderson, G. A.; Smith, R. D. BMC Bioinf. 2013, 14, 49. (13) Xia, J. G.; Mandal, R.; Sinelnikov, I. V.; Broadhurst, D.; Wishart, D. S. Nucleic Acids Res. 2012, 40, W127−W133. (14) Xia, J. G.; Psychogios, N.; Young, N.; Wishart, D. S. Nucleic Acids Res. 2009, 37, W652−W660. (15) Xia, J. G.; Sinelnikov, I. V.; Han, B.; Wishart, D. S. Nucleic Acids Res. 2015, 43, W251−W257. (16) Kessler, N.; Neuweger, H.; Bonte, A.; Langenkamper, G.; Niehaus, K.; Nattkemper, T. W.; Goesmann, A. Bioinformatics 2013, 29, 2452−2459. (17) Neuweger, H.; Albaum, S. P.; Dondrup, M.; Persicke, M.; Watt, T.; Niehaus, K.; Stoye, J.; Goesmann, A. Bioinformatics 2008, 24, 2726−2732. (18) Gowda, H.; Ivanisevic, J.; Johnson, C. H.; Kurczy, M. E.; Benton, H. P.; Rinehart, D.; Nguyen, T.; Ray, J.; Kuehl, J.; Arevalo, B.; Westenskow, P. D.; Wang, J. H.; Arkin, A. P.; Deutschbauer, A. M.; Patti, G. J.; Siuzdak, G. Anal. Chem. 2014, 86, 6931−6939. (19) Tautenhahn, R.; Patti, G. J.; Rinehart, D.; Siuzdak, G. Anal. Chem. 2012, 84, 5035−5039. (20) Sekula, P.; Goek, O. N.; Quaye, L.; Barrios, C.; Levey, A. S.; Romisch-Margl, W.; Menni, C.; Yet, I.; Gieger, C.; Inker, L. A.; Adamski, J.; Gronwald, W.; Illig, T.; Dettmer, K.; Krumsiek, J.; Oefner, P. J.; Valdes, A. M.; Meisinger, C.; Coresh, J.; Spector, T. D.; Mohney, R. P.; Suhre, K.; Kastenmuller, G.; Kottgen, A. J. Am. Soc. Nephrol. 2016, 27, 1175−1188. (21) Tzoulaki, I.; Ebbels, T. M.; Valdes, A.; Elliott, P.; Ioannidis, J. P. Am. J. Epidemiol. 2014, 180, 129−139. (22) Altmaier, E.; Fobo, G.; Heier, M.; Thorand, B.; Meisinger, C.; Romisch-Margl, W.; Waldenberger, M.; Gieger, C.; Illig, T.; Adamski, J.; Suhre, K.; Kastenmuller, G. Eur. J. Epidemiol. 2014, 29, 325−336. (23) Osborn, M. P.; Park, Y.; Parks, M. B.; Burgess, L. G.; Uppal, K.; Lee, K.; Jones, D. P.; Brantley, M. A., Jr. PLoS One 2013, 8, e72737. (24) Zamora-Ros, R.; Touillaud, M.; Rothwell, J. A.; Romieu, I.; Scalbert, A. Am. J. Clin. Nutr. 2014, 100, 11−26. G

DOI: 10.1021/acs.analchem.6b00603 Anal. Chem. XXXX, XXX, XXX−XXX

Article

Analytical Chemistry (51) Rahman, A. A.; Shahid, I. Z.; Pilowsky, P. M. Peptides 2013, 44, 15−24. (52) Brighton, P. J.; Szekeres, P. G.; Willars, G. B. Pharmacol. Rev. 2004, 56, 231−248. (53) Wishart, D. S.; Jewison, T.; Guo, A. C.; Wilson, M.; Knox, C.; Liu, Y.; Djoumbou, Y.; Mandal, R.; Aziat, F.; Dong, E.; Bouatra, S.; Sinelnikov, I.; Arndt, D.; Xia, J.; Liu, P.; Yallou, F.; Bjorndahl, T.; Perez-Pineiro, R.; Eisner, R.; Allen, F.; Neveu, V.; Greiner, R.; Scalbert, A. Nucleic Acids Res. 2013, 41, D801−807. (54) Wishart, D. S.; Knox, C.; Guo, A. C.; Eisner, R.; Young, N.; Gautam, B.; Hau, D. D.; Psychogios, N.; Dong, E.; Bouatra, S.; Mandal, R.; Sinelnikov, I.; Xia, J.; Jia, L.; Cruz, J. A.; Lim, E.; Sobsey, C. A.; Shrivastava, S.; Huang, P.; Liu, P.; Fang, L.; Peng, J.; Fradette, R.; Cheng, D.; Tzur, D.; Clements, M.; Lewis, A.; De Souza, A.; Zuniga, A.; Dawe, M.; Xiong, Y.; Clive, D.; Greiner, R.; Nazyrova, A.; Shaykhutdinov, R.; Li, L.; Vogel, H. J.; Forsythe, I. Nucleic Acids Res. 2009, 37, D603−610. (55) Wishart, D. S.; Tzur, D.; Knox, C.; Eisner, R.; Guo, A. C.; Young, N.; Cheng, D.; Jewell, K.; Arndt, D.; Sawhney, S.; Fung, C.; Nikolai, L.; Lewis, M.; Coutouly, M. A.; Forsythe, I.; Tang, P.; Shrivastava, S.; Jeroncic, K.; Stothard, P.; Amegbey, G.; Block, D.; Hau, D. D.; Wagner, J.; Miniaci, J.; Clements, M.; Gebremedhin, M.; Guo, N.; Zhang, Y.; Duggan, G. E.; Macinnis, G. D.; Weljie, A. M.; Dowlatabadi, R.; Bamforth, F.; Clive, D.; Greiner, R.; Li, L.; Marrie, T.; Sykes, B. D.; Vogel, H. J.; Querengesser, L. Nucleic Acids Res. 2007, 35, D521−526. (56) Smith, C. A.; O’Maille, G.; Want, E. J.; Qin, C.; Trauger, S. A.; Brandon, T. R.; Custodio, D. E.; Abagyan, R.; Siuzdak, G. Ther. Drug Monit. 2005, 27, 747−751. (57) Horai, H.; Arita, M.; Kanaya, S.; Nihei, Y.; Ikeda, T.; Suwa, K.; Ojima, Y.; Tanaka, K.; Tanaka, S.; Aoshima, K.; Oda, Y.; Kakazu, Y.; Kusano, M.; Tohge, T.; Matsuda, F.; Sawada, Y.; Hirai, M. Y.; Nakanishi, H.; Ikeda, K.; Akimoto, N.; Maoka, T.; Takahashi, H.; Ara, T.; Sakurai, N.; Suzuki, H.; Shibata, D.; Neumann, S.; Iida, T.; Tanaka, K.; Funatsu, K.; Matsuura, F.; Soga, T.; Taguchi, R.; Saito, K.; Nishioka, T. J. Mass Spectrom. 2010, 45, 703−714. (58) Wang, Y. F.; Kora, G.; Bowen, B. P.; Pan, C. L. Anal. Chem. 2014, 86, 9496−9503. (59) Dhanasekaran, A. R.; Pearson, J. L.; Ganesan, B.; Weimer, B. C. BMC Bioinf. 2015, 16, 62. (60) Xia, J. G.; Wishart, D. S. Nucleic Acids Res. 2010, 38, W71−W77. (61) Xia, J. G.; Wishart, D. S. Bioinformatics 2010, 26, 2342−2344.

H

DOI: 10.1021/acs.analchem.6b00603 Anal. Chem. XXXX, XXX, XXX−XXX