Integrated Strategy for Unknown EI–MS Identification Using Quality

May 18, 2017 - LEARN ABOUT THESE METRICS ... Moreover, in silico derivatization software, MetaboloDerivatizer, was developed to calculate the chemical...
0 downloads 0 Views 2MB Size
Article pubs.acs.org/ac

Integrated Strategy for Unknown EI−MS Identification Using Quality Control Calibration Curve, Multivariate Analysis, EI−MS Spectral Database, and Retention Index Prediction Teruko Matsuo,†,# Hiroshi Tsugawa,*,†,‡,§,# Hiromi Miyagawa,∥ and Eiichiro Fukusaki*,† †

Department of Biotechnology, Graduate School of Engineering, Osaka University, 2-1 Yamadaoka, Suita, Osaka 565-0871, Japan RIKEN Center for Sustainable Resource Science, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan § RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan ∥ GL Sciences Inc., Iruma, Saitama 358-0032, Japan Anal. Chem. 2017.89:6766-6773. Downloaded from pubs.acs.org by DURHAM UNIV on 08/09/18. For personal use only.



S Supporting Information *

ABSTRACT: Compound identification using unknown electron ionization (EI) mass spectra in gas chromatography coupled with mass spectrometry (GC−MS) is challenging in untargeted metabolomics, natural product chemistry, or exposome research. While the total count of EI−MS records included in publicly or commercially available databases is over 900 000, efficient use of this huge database has not been achieved in metabolomics. Therefore, we proposed a “four-step” strategy for the identification of biologically significant metabolites using an integrated cheminformatics approach: (i) quality control calibration curve to reduce background noise, (ii) variable selection by hypothesis testing in principal component analysis for the efficient selection of target peaks, (iii) searching the EI−MS spectral database, and (iv) retention index (RI) filtering in combination with RI predictions. In this study, the new MS-FINDER spectral search engine was developed and utilized for searching EI− MS databases using mass spectral similarity with the evaluation of false discovery rate. Moreover, in silico derivatization software, MetaboloDerivatizer, was developed to calculate the chemical properties of derivative compounds, and all retention indexes in EI−MS databases were predicted using a simple mathematical model. The strategy was showcased in the identification of three novel metabolites (butane-1,2,3-triol, 3-deoxyglucosone, and palatinitol) in Chinese medicine Senkyu for quality assessment, as validated using authentic standard compounds. All tools and curated public EI−MS databases are freely available in the ‘Computational MS-based metabolomics’ section of the RIKEN PRIMe Web site (http:// prime.psc.riken.jp).

M

A basic approach for annotating unknown metabolites uses the vast amount of EI spectral databases, including MassBank,6 and commercially available NIST and Wiley databases. Currently, there are 15 302 experimental spectra records of 9 003 unique structures in the MassBank of North America (MoNA), 276 259 experimental spectra records of 242 477 unique structures in NIST 14, and 719 456 experimental spectra records of 583 059 unique structures in the Wiley 10th spectra database. So far, the successful identification of small molecules has been achieved in many research studies by the combination of NIST (+Wiley and MassBank) and AMDIS software which has frequently been used for GC−MS spectral deconvolution followed by spectral searching with a certain RI tolerance.7 Additionally, the prediction of EI spectra has recently been reported to expand their capability.8 Databases and related programs could assist greatly in annotating unknown EI spectra. However, searching spectral databases

ass spectrometry (MS)-oriented untargeted metabolomics simultaneously provides ion abundance information for several hundred metabolites, including both identified and unidentified metabolites, in biological samples, known as a “metabolome table”, which is used for further statistical analyses in various medical or biological sciences.1−3 Gas chromatography coupled with electron ionization (EI) mass spectrometry (GC−MS) is a popular technique for untargeted metabolomics.4,5 The workflow of GC−MS based metabolomics allows 500−1000 chromatographic peaks to be obtained in a single run with highly stable retention index (RI) and EI−MS spectral information. However, of these, only 100−200 small biomolecules, representing 20% of the unique chromatographic peaks, can be identified based on similarity matching with EI spectra and retention indices (RIs) of reference compounds. Unfortunately, there is no perfect solution for identifying whole chromatographic peaks, yet studies tackling this issue have been reported, as described below. Therefore, novel and more efficient strategies for the structural elucidation of unknown EI−MS spectra are required in metabolomics. © 2017 American Chemical Society

Received: March 25, 2017 Accepted: May 18, 2017 Published: May 18, 2017 6766

DOI: 10.1021/acs.analchem.7b01010 Anal. Chem. 2017, 89, 6766−6773

Article

Analytical Chemistry or using in silico spectral prediction often provides many false positive candidates, and in fact, analyzing all authentic standards for candidates is unrealistic. The great advantage of GC−MS when compared to LC−MS is the reusability of robust and accurate retention markers, i.e., RI, which have been stored for decades by some famous methods including Kovats,9 Lee,10 and Fiehn11 indices, where the conversion is also available as described in Stephen et al.7 or in MoNA database. Their experimental RI values have effectively been utilized for the reliable identification of biomolecules, enabling one to construct the highly scalable GC−MS metabolome databases such as BinBase12 and Golm DB.13 However, the RI information is not always available, and therefore, the methodology for RI prediction is required for the comprehensive use of experimental records as several groups have reported.14,15 Furthermore, narrowing down the unknown candidates in an analytical and statistical manner is also important before further annotation process. For example, quality control (QC) samples, a mixture of small aliquots from each biological sample where all molecules in each sample can be detected, have been utilized to filter out chromatographic noise.16 Additionally, the methodology for variable selection in multivariate analysis has been developed to objectively obtain biologically important metabolites using hypothesis testing.17 Therefore, an integrated cheminformatics approach using QC samples, statistical analysis, spectral searches, and retention index prediction needs to be developed for efficient use of EI−MS databases to identify biologically significant metabolites. In this study, we propose a four-step strategy for compound identification of unknown GC−MS peaks (Figure 1). First, noise chromatographic peaks are removed using a calibration curve from a dilution series of a pooled QC sample (QC curve filtering), which gives a curated metabolome table. The basic concept of this QC curve filtering is that the ion abundance of biologically meaningful chromatographic peaks should fluctuate accordingly in the dilution series of a pooled QC sample, while background noise peaks from derivative reagents, column bleeds, and other reactants should not be relevant to the QC curve. Second, variable selection in principal component analysis (PCA) is objectively performed by hypothesis testing. Third, searching the EI spectra database for unknown candidates is executed using the newly developed MS-FINDER spectral search engine, where the publicly available GC−MS records are implemented (15 302 records at MS-FINDER version 2.12) while the Wiley 10th database was also imported independently as “user-defined library” in this study. Finally, we provide a retention index prediction approach using a quantitative structure-retention relationship (QSRR) method in combination with newly developed in silico derivatization software MetaboloDerivatizer to filter out remaining false positive candidates. Herein, we demonstrate this strategy using 64 GC−MS data files for Chinese medicine Senkyu which is the dried root of Conidium of f icinale Makino in Japan utilized as a herbal medicine for menopause, poor circulation, drainage, and skin disease. Our approach discovered three new biomolecules in Senkyu, which will give us novel insights to understand the medicinal effects and contribute as quality markers to judge their biological origins.

Figure 1. Four-step strategy for the identification of unknown EI−MS spectra. Step 1: After the raw MS data set was processed, the metabolome table was curated by QC curve filter, which removed the procedure artifacts or chromatographic noises. Step 2: Hypothesis testing in principal component analysis was used to obtain biologically meaningful chromatographic peaks. Step 3: EI−MS database-oriented structure elucidation was performed based on spectral similarity matching. Step 4: After retention index predictions were generated by multiple regression analysis, most false positive candidates were excluded by RI filtering. Finally, commercially available and/or synthesized compounds were analyzed to validate metabolite annotations.



EXPERIMENTAL SECTION Samples and Experimental Procedures. Six types of Chinese medicine Senkyu were supplied by Tochimoto 6767

DOI: 10.1021/acs.analchem.7b01010 Anal. Chem. 2017, 89, 6766−6773

Article

Analytical Chemistry

DIAL (see above). Note that only the experimental EI-MS spectra were utilized in this study, and the publicly distributable spectral records, i.e., MassBank and MoNA, were available at the RIKEN PRIMe Web site. In Silico Derivatization for MeOX and TMS Reaction of Small Molecules. In silico derivatization software, MetaboloDerivatizer, developed in C# language, converted the target biomolecules into the methoxy (MeOX) and trimethylsilylated (TMS) forms. The chemistry development kit21 was used to recognize the SMILES code of a target molecule as a list of atom and bond connectivity and to export the reacted SMILES code from the list of atom and bond connectivity. For the MeOX reaction, the carbonyl groups (a carbon−oxygen double bond) in ketone or aldehyde moieties were recognized as reactive CO and converted to CN−OCH3. For the TMS reaction, all acidic protons in OH, COOH, NRH, and SH moieties were converted to O−TMS, COO−TMS, NR−TMS, and S−TMS, respectively. In contrast, the reaction of primary amines was optional, with users able to select the degree of conversion as −N(TMS)2, −NH−TMS, and −NH2, with reaction with one TMS, i.e., −NH−TMS, used as the default setting. The derivative form could be exported as the SMILES code and batch conversion was also available. Retention Index Prediction. Our in-house spectral library was used to construct the retention index (RI) regression model. The derivatization form of compound structures was automatically created by MetaboloDerivatizer. For metabolites containing primary amines, all possible TMS forms, as described above, were prepared. The determination of derivative isomers from the reaction of primary amines, such as Valine 1TMS, 2TMS, and 3TMS, and from the cis/trans configuration of MeOX reaction, were manually performed by GC−MS experts with literature and database mining and substantial manual analysis. While our library contained 430 spectral records, we used 337 spectral records of these, where we could believe their derivate forms (also see Results and Discussion). Molecular descriptors of the derivative SMILES code were computed by PaDEL-descriptor version 2.18.22 Furthermore, the correlation coefficient between the RI value and each descriptor value was calculated, and descriptors with correlation coefficient values smaller than 0.8 were excluded (Supporting Information Table S3). A multiple regression model was constructed by statistical language R version 3.2.2, and the “forward-step” function, which applies the Akaike Information Criterion (AIC) in combination with a cross-validation method for model selection, was used to determine important variables (see Results and Discussion).

Tenkaido Co. Ltd. (Osaka, Japan), and their origins were defined by species (Cnidium of ficinate Makino or Ligusticum chuanxiong Hort.), cultivation area (Japan or China), and manufacturing process (with or without steaming), as summarized in Supporting Information Table S1. A 10 mg sample of each dried root was weighed and homogenized in a 2 mL Eppendorf tube. For quality control samples, 10 mg of each sample was added into a 15 mL tube (total 360 mg; 6 biological samples and 6 biological replicates) and then divided into six 2 mL Eppendorf tubes such that they contained 10, 20, 30, 40, 50, and 60 mg, respectively. An empty tube was also prepared, labeled as 0 mg QC, for monitoring background effects. Metabolite extraction, derivatization, and GC−MS analysis followed reported methods (Supporting Information).18,19 Data Processing. NetCDF format files exported from GCMSsolutions (Shimadzu Co., Kyoto, Japan) were converted to Analysis Base Framework (ABF) format files using a free ABF file converter (http://www.reifycs.com/AbfConverter/ index.html). MS-DIAL (version 2.48) software was downloaded from the RIKEN PRIMe Web site and used for data processing of the GC−MS data set. The parameters were set as follows: smoothing level, 3; minimum peak height, 2000; average peak width, 20; with default parameters used for the others. The MSP format file (EI−MS reference library) was created using our in-house database and can also be downloaded from RIKEN PRIMe Web site (entitled Osaka Univ. DB). Note that the ranking of structure candidates was based on mass spectral similarity, which was the total score of dot product, reverse dot product, and existence percentage of fragment ions (weighted 2:2:1, respectively) in combination with RI similarity. Details of their mathematical functions followed our previous report.2 After automatic data processing was finished, the identification results were manually curated with the MS-DIAL graphical user interface by a GC−MS expert, where false positive identification results were changed to “unknown”. A total of 1975 chromatographic peaks (labeled as aligned spots) were created, comprising 127 identified and 1848 unknown peaks (Supporting Information Table S2). All GC−MS data files can be downloaded from the RIKEN Dropmet Web site (http://prime.psc.riken.jp/?action=drop_ index). MS-FINDER Spectral Search Engine for Using Mass Spectral Databases. The spectral search engine was implemented in MS-FINDER software. EI−MS spectral records from MassBank and MoNA were implemented internally as the default database for the search engine. The Wiley 10th mass spectral records were prepared as MSP format, and it was also imported in MS-FINDER as “user-defined MSP database”. Note that we excluded the spectral records of Wiley where the SMILES code was not recognized by ChemAxon molconverter (https://www.chemaxon.com/): a total of 668 231 spectral records was used in this study. The NIST14 EI−MS spectra were not utilized because they could not be converted to ASCII format by their Lib2NIST converter to be used in MS-FINDER. SMILES and InChIKey for all internal records were generated from the InChI code, CAS number, or (sometimes) chemical name information using ChemAxon molconverter for InChI code or Chemical Translation Service20 for CAS number and chemical name. Then, all of the molecular structures were computationally derivatized as shown in the next section for their RI predictions. The spectral similarity is used for structure ranking by the same as in the method of MS-



RESULTS AND DISCUSSION The four-step strategy reported herein used 64 GC−MS data of Chinese medicine Senkyu (six different origins, n = 6; seven concentration ranges of QC samples, n = 4), as described in Supporting Information Table S1. Although a metabolomics approach to objectively distinguishing Senkyu origin has been suggested previously,18 more than 80% of the chromatographic peaks have not been utilized for marker discovery because of the lack of compound information. Therefore, the same biological sets were used in this paper to promote our strategy for discovering novel biomarkers. Step 1. Data Reduction Using QC Curve Filtering. MSDIAL automatically generated the metabolome table, which has 1974 features × 36 sample and 28 QCs (Supplementary Figure 6768

DOI: 10.1021/acs.analchem.7b01010 Anal. Chem. 2017, 89, 6766−6773

Article

Analytical Chemistry S1). However, chromatographic features are not always derived from intermediate metabolites of the living organism. Especially in GC−MS, background peaks derived from derivative reagents, procedure-related artifacts, and column bleeding must be removed before statistical analysis. Some artifact EI− MS spectra can be registered in spectral library with features identified as artifact peaks excluded from the metabolome table. Furthermore, an ad hoc threshold against peak intensity is utilized to remove low signal-to-noise ratio features. Here, we introduce an experimental data dependent method, QC curve filtering, which can objectively obtain biologically related metabolites from the initial metabolome table. The QC curve filtering hypothesis is that ion abundances of intermediate metabolites should be linearly related to the increase of concentration in diluted QC samples. An example of a QC curve result is described in Figures 2a and 2b. Ion abundances of diluted QC samples in alignment ID 378 (defined by MS-DIAL) increased linearly along with the diluted series, indicating that alignment ID 378 was derived from the intermediate metabolite of Senkyu (Figure 2a). In contrast, ion abundances of QC samples in alignment ID 497 were mostly similar among the diluted QCs, indicating that the chromatographic peaks were recognized as artifacts arising from a background substance (Figure 2b). Importantly, all chromatographic peak shapes obtained in this study were not saturated, in both the GC column and MS detector. Artifact peaks were excluded if aligned chromatographic peaks did not increase continually with the four concentration ranges or if the average relative standard deviation in each diluted QC stage was greater than 20 (also see Supporting Information). After exclusion, the number of features was 457, comprising 127 identified and 330 unknown peaks. Step 2. Selection of Biologically Important Metabolites Using Principal Component Analysis. Principal component analysis is the first choice in multivariate analysis in metabolomics for detecting outliers and gaining insights into metabolome phenotypes related to the biological phenotypes. In this case, the autoscaled data matrix was applied to PCA, and the first and second principal component axes (PC1 and PC2) were closely related to differences between cultivation countries (Japan and China), where the metabolome phenotype of C. of f icinale combined with the steaming procedure was substantially different from other origins, as described in PC2 (Figure 3a). In this study, we utilized a hypothesis testing method17 to objectively obtain biologically important metabolites related to PC1 and PC2 axes. Briefly, loading values are defined as the correlation coefficient between the score and raw matrix values, and the statistical test is based on the hypothesis that the distribution of correlation coefficient values follows the Tdistribution of an (n − 2) degree of freedom. We implemented this PCA methodology in our open-source Excel macro (available at http://prime.psc.riken.jp/Metabolomics_ Software/StatisticalAnalysisOnMicrosoftExcel/index.html). The scatter plot in Figure 3b indicates that 245 chromatographic peaks (black), including 170 unknown metabolites, were recognized as significant metabolites contributing to the PC1 and PC2 axes. Steps 3 and 4. Spectral Database-Oriented Compound Search with Retention Index Prediction. First, we addressed the curation of MassBank and MoNA databases to define the structure data by InChIKey, to examine the reproducibility of RI values among different laboratories, and

Figure 2. Quality control calibration curve filter. (a) Example of chromatographic feature passing the QC curve filter. Ion abundances of alignment ID 378 were raised, along with the increase in QC concentrations, which was recognized as the intermediate metabolite. (b) Example of chromatographic feature not passing the QC curve filter, which was recognized as an artifact in GC−MS analysis because the ion abundances of alignment ID 497 were mostly equal among the QC diluted series.

to evaluate the false discovery rate (FDR) using wider retention index tolerance for searching structures. After the InChIKey values were generated for all EI−MS records, we examined the metabolite subset relationships among laboratories contributing to public repositories: RIKEN (Prefix PR in MassBank), Kazusa (Prefix KZ in MassBank), Osaka University (Prefix OUF in MassBank), GL-science (Prefix GLS in MassBank), and UC Davis (BinBase in MoNA), which contained information on retention index (Figure 4a). Notably, the Fiehn RI values from the Fiehn BinBase records were converted to Kovats RI values using a simple regression model constructed by analyzing a mixture of fatty acid methyl esters and alkanes. It should also be noted that the first layer of InChIKey was utilized for structure definition, although some stereoisomers (e.g., glucose and galactose) can be distinguished by GC−MS. The Fiehn BinBase library was the top contributor among public resources 6769

DOI: 10.1021/acs.analchem.7b01010 Anal. Chem. 2017, 89, 6766−6773

Article

Analytical Chemistry

records from slightly polar columns such as DB-5 can be used at least in our case. Next, we addressed the study of retention index prediction to utilize not only the remaining 7774 structure records in MassBank and MoNA but also the information on Wiley commercial library here (Figure 4c). First of all, the regression model for RI prediction was constructed by our library containing 430 records. All structural data in our library was converted to MeOX and TMS derivatives using MetaboloDerivatizer, where reactive CO moieties were converted to MeOX (E/Z isomer was manually determined), and all of acidic protons except for primary amines was converted to TMS. For a total of 131 metabolites of 430 records which contain the primary amine (−NH2) moiety, the conversion was performed by three options, i.e., from −NH2 to −N(TMS)2, to −NH−TMS, or to −NH2 (see Experimental Section): a total of 612 derivative forms were created here. After substantial manual efforts, the reliable RI- and derivative form pairs of 337 records that we could define were utilized for further analysis (Supporting Information Table S3). Importantly, the wider variety of MeOX or TMS derivative forms can be monitored in GC−MS experiments: even in glucose, where 5TMS is the most stable form, the 4TMS form is sometimes monitored. The curation for dealing with their derivative isomers will facilitate the reliable identification in GC−MS metabolomics, and the GC−MS databases such as the Fiehnlib12 and Golm DB13 can assist their curations. A total of 4566 descriptors was generated by PaDEL descriptor version 2.18.22 As a cross-validation approach, the RI and descriptor pairs were sorted by RI values, with odd and even ordering records used as set 1 (169 pairs) and set 2 (168 pairs), respectively. When set 1 was used as the training set, set 2 was utilized as the test set, and vice versa. The descriptors (ATSc1,23 topoDiameter,23 MLFER_L,24 and ETA_Beta25) were commonly selected as the top four important variables in both models, i.e., set 1 (training) → set 2 (test) and set 2 (training) → and set 1 (test). The coefficients and intercept are shown in Figure 4c. The standard deviation (SD) and Rsquared (R2) values of the first set (set 1 → set 2) and second set (set 2 → set 1) were 78 and 0.93 in set 1 → set 2, and 88 and 0.93 in set 2 → set 1, respectively. Considering their chemical properties (see the Supporting Information), the retention index (RI) of TMS-derivative compounds can be predicted by reflections of substructural repeatability (ATSc1), gas−hexadecane interaction (MLFER_L), electronic state (ETA_Beta), and molecules size (topoDiameter), at least under our GC conditions (5% diphenyl and 95% dimethyl polysiloxane column with a linear gradient condition), where the standard deviation was less than 100 (one carbon chain length difference in the Kovats index) while the reusability of experimental RI values was shown as less than 50. Although the accuracy is not sufficient for “identification” of metabolites based on retention index only, this approach can be utilized as a “filter” for the exclusion of many false positive candidates, and the combination with the EI−MS matching will lead us to the reliable annotation of unknown metabolites as shown in the showcase of three metabolites. Moreover, this approach compensates for the lack of information on MeOX- and TMS derivative forms in the NIST database, resulting in the accuracy improvement of RI predictions for derivatized metabolites compared to 300−400 Kovats unit errors (95% confident interval), which are often generated by NIST RI estimation program.14

Figure 3. Variable selection in principal component analysis. (a) Principal components 1 and 2 were interpreted as the axes, showing the differences between C. of f icinale and L. chuanxiong and the production areas (Japan or China), respectively. Abbreviations: CJSD, C. of f icinale, Japan, steaming and dry; CJD, C. of f icinale, Japan, dry only; CCSD, C. of f icinale, China, steaming and dry; CCD, C. of f icinale, China, dry only; LCSD, L. chuanxiong, China, steaming and dry; LCD, L. chuanxiong, China, dry only. (b) Loading plot with significant chromatographic features indicated by solid black circles. The p-values of hypothesis testing were corrected using Bonferroni’s method: blackfilled circles stand for significant features.

(A in Figure 4a), containing a total of 259 unique compound records. In contrast, GL-Science (B) and Osaka University (D; our library) repositories contained 51 and 30 unique structures, respectively, and shared 96 unique structural records. The results in Figure 4a motivated us to use the RI and EI− MS spectral records of the other institutions in addition to our library (330 unique structures), covering a total of 829 unique structures. Therefore, RI reproducibility was examined for 70 unique structures shared by five research institutes (Figure 4b), of which 14 sugar compounds were excluded due to difficulties in structure definition. This evaluation is important for understanding reliable RI tolerance in spectra searching. As a result, we found that the standard deviations among publicly available experimental RIs were less than 50 (0.5 carbon chain length difference in Kovats index), except for the Kazusa DNA institute. In fact, the commercial NIST database containing experimental RI values for over 40 000 unique structures can also be utilized by using NIST MS search software, where the 6770

DOI: 10.1021/acs.analchem.7b01010 Anal. Chem. 2017, 89, 6766−6773

Article

Analytical Chemistry

Figure 4. Database curation for MassBank and MoNA. (a) Venn diagram showing overlaps among published structures containing retention index (RI) information, created using the first layer of InChIKey. “Others” denotes MassBank or MoNA records not containing RI information. (b) Reproducibility of RI values, examined by overlapped (ABCDE) metabolites. The y-axis shows the RI difference between Osaka University (D) and the others. The standard deviation (SD) of the RI differences was also calculated. (c) Total of 337 RI and descriptor pairs was divided in two (set 1 and set 2), with one used as the training set and the other as the test set: set 1 → set 2 means that sets 1 and 2 were used as training and test sets, respectively. Mathematical equation for RI prediction is shown at the bottom, with R-squared and SD values also described. (d) False discovery rate (FDR) evaluated from 127 identified spectra, in combination with publicly available EI−MS records. The x- and y-axes represent RI tolerance and FDR value, respectively. Black and red lines are identification results for full layer and first layer matching of InChIKey, respectively.

We applied this regression model to 11 111 MassBank- and 668 231 Wiley records not containing RI information. The derivatization was simply performed by the optimal parameter of MetaboliteDetector where the primary amine is converted as 1TMS (−NH−TMS): a total of 679 342 derivative forms was exported, and then their retention indexes were predicted via the regression model. Evaluation of False Discovery Rate. When using the expanded spectral library, the RI tolerance for searching structures should be spread for public experimental RIs (±50 RI) and for predicted RIs (±100 RI), in contrast to the validated in-house RIs (less than 10 RI). Therefore, it was important to estimate the false discovery rate (FDR) in searching spectral databases, along with the wider RI tolerance setting. We utilized a total of 127 spectra identified in Chinese medicine Senkyu for the validation kit to evaluate FDRs. The RI and EI spectra of these compounds were manually validated by a mass spectrometry specialist with our in-house library. Furthermore, a total of 13 570 spectra containing all MassBank and MoNA records were used as the search space (Figure 4d). As a result, the FDR values when searching other experimental RIs (tolerance 50) and predicted RIs (tolerance 100) were estimated as 16.5% and 37.0%, respectively, to distinguish the stereoisomers. In contrast, these FDR values were decreased for

the determination of molecular skeleton (atom and bond connectivity described by the first layer of InChIKey) to less than 9.4% and 14.1% in 50 and 100 RI tolerances, respectively. This result suggested that analyzing authentic standards was essential, but the efficiency in obtaining the correct structure was improved by this RI filtering compared to results without the RI criteria (19%). Furthermore, the computational cost (retrieval frequency of spectral records) was reduced: the total count of spectral search trials for 127 queries to 13 570 spectra by using 50, 100, and infinite RI tolerances was approximately 100 000, 200 000, and 1 700 000, respectively. Identification of Three Metabolites. Our four-step strategy was showcased in the identification of three new metabolites not stored in our own library and not reported for Chinese medicine Senkyu (Figure 5). It should be noted this strategy is not suitable for the identification of metabolites that are truly new compounds (unknown−unknowns), but for the practical annotation of unknown EI−MS spectra using spectral databases. We focused on three unknown EI−MS spectra, assigned as ID 301, 761, and 1538, in MS-DIAL. These unknowns were some of the 170 unknown peaks obtained from PCA hypothesis testing. A total of 682 248 EI−MS spectra, including all private, public, and commercial records, were imported into the MS-FINDER spectral search engine. Note that the library 6771

DOI: 10.1021/acs.analchem.7b01010 Anal. Chem. 2017, 89, 6766−6773

Article

Analytical Chemistry

Figure 5. Identification of three metabolites via spectral searching and RI filtering. EI−MS spectra of three biologically significant chromatographic peaks were subjected to candidate searching in MS-FINDER software with an 80% cutoff of spectral similarity. The structure candidates were further refined by RI filtering, resulting in the number of structure candidates for ID 301, 761, and 1538 changing from 46, 5, and 64 from spectral searching to 4, 2, and 5 from RI filtering, respectively. Identification was finally performed using commercially available authentic standard compounds.

contains the EI−MS spectra not only from MeOX-TMS derivative forms but also from other derivative types in addition to underivatized structures: in this study, we utilized the entire records because it is a tremendous task to investigate the background information for over 680 000 records. The cutoff value for spectral similarity was set to 80%. Spectral searching gave a total of 46, 5, and 64 structure candidates for ID 301, 761, and 1538, respectively. The RI values for EI−MS records with no RI information were predicted as described in above steps 3 and 4. The cutoff value for RI filtering was set to 50 (0.5 carbon chain length difference in the Kovats index) where the FDR value was estimated as 9.4% according to the previous section. With this RI filter in addition to spectral matching, the number of structure candidates became 4, 2, and 5 for ID 301, 761, and 1538, respectively. Commercially available standard compounds were purchased, and these candidates were finally identified as butane-1,2,3-triol, 3-deoxyglucosone, and palatinitol, respectively (Figure 5). Interestingly, palatinitol, which is known as a growth factor of Bif idobacterium in the human microbiome,26 was highly accumulated in Cnidium off icinale Makino that had undergone a steaming process, which is recognized as the general product of Chinese medicine Senkyu in Japan. As the Chinese medicine Senkyu is used to improve gastrointestinal function, this result indicated that palatinitol was the bioactive compound for improving the human microbiome by effecting Bif idobacterium growth. In contrast, butane-1,2,3-triol can be used as a quality marker to distinguish Senkyu imitations on the Japanese market. Importantly, these compounds were efficiently identified at low cost, and with little time and effort by the four-step strategy.

In fact, we had tried to annotate additional 10 chromatographic peaks in the 170 biologically important-unknown peaks, where the top candidate suggested by MS-FINDER has greater than 90% spectra similarity with the reference spectra. Of these, four peaks (ID 76, 297, 1216, and 1527) were investigated by the authentic standard compounds, for which sugar-related structures were annotated. Unfortunately, the retention indexes were slightly different from these unknown peaks. Nevertheless, these results suggested that our strategy efficiently led us to annotate novel metabolites because the “challenge” of authentic standard experiments was succeeded with 43% accuracy (three/ seven).



CONCLUSION

We have demonstrated an efficient four-step strategy using EI− MS databases for metabolite annotation of unknown EI−MS spectra. QC curve filtering and hypothesis testing in principal component analysis efficiently extracted reliable chromatographic peaks that could be considered as intermediate and biologically important metabolites. The MS-FINDER spectral search engine contributed in identifying structure candidates using spectral similarities. Finally, the prediction of retention indices was considerably improved by narrowing down structure candidates. This strategy can be used as a useful guide for the identification of unknown EI−MS spectra by using publicly or commercially available spectral records containing over 900 000 mass spectra. 6772

DOI: 10.1021/acs.analchem.7b01010 Anal. Chem. 2017, 89, 6766−6773

Article

Analytical Chemistry



(7) Stein, S. Anal. Chem. 2012, 84, 7274−7282. (8) Allen, F.; Pon, A.; Greiner, R.; Wishart, D. S. Anal. Chem. 2016, 88, 7689−7697. (9) Kovats, E. Helv. Chim. Acta 1958, 41, 1915−1932. (10) Lee, M. L.; Vasslaros, D. L.; White, C. M. Anal. Chem. 1979, 51, 768−773. (11) Kind, T.; Wohlgemuth, G.; Lee, D. Y.; Lu, Y.; Palazoglu, M.; Shahbaz, S.; Fiehn, O. Anal. Chem. 2009, 81, 10038−10048. (12) Fiehn, O.; Wohlgemuth, G.; Scholz, M. Lect. Notes Comput. Sci. 2005, 3615, 224. (13) Kopka, J.; Schauer, N.; Krueger, S.; Birkemeyer, C.; Usadel, B.; Bergmüller, E.; Dörmann, P.; Weckwerth, W.; Gibon, Y.; Stitt, M.; Willmitzer, L.; Fernie, A. R.; Steinhauser, D. Bioinformatics 2005, 21, 1635−1638. (14) Stein, S. E.; Babushok, V. I.; Brown, R. L.; Linstrom, P. J. J. Chem. Inf. Model. 2007, 47, 975−980. (15) Kumari, S.; Stevens, D.; Kind, T.; Denkert, C.; Fiehn, O. Anal. Chem. 2011, 83, 5895−5902. (16) Want, E. J.; Masson, P.; Michopoulos, F.; Wilson, I. D.; Theodoridis, G.; Plumb, R. S.; Shockcor, J.; Loftus, N.; Holmes, E.; Nicholson, J. K. Nat. Protoc. 2013, 8, 17−32. (17) Yamamoto, H.; Fujimori, T.; Sato, H.; Ishikawa, G.; Kami, K.; Ohashi, Y. BMC Bioinf. 2014, 15, 51. (18) Kobayashi, S.; Nagasawa, S.; Yamamoto, Y.; Donghyo, K.; Bamba, T.; Fukusaki, E. J. Biosci. Bioeng. 2012, 114, 86−91. (19) Tsugawa, H.; Bamba, T.; Shinohara, M.; Nishiumi, S.; Yoshida, M.; Fukusaki, E. J. Biosci. Bioeng. 2011, 112, 292−298. (20) Wohlgemuth, G.; Haldiya, P. K.; Willighagen, E.; Kind, T.; Fiehn, O. Bioinformatics 2010, 26, 2647−2648. (21) Steinbeck, C.; Han, Y.; Kuhn, S.; Horlacher, O.; Luttmann, E.; Willighagen, E. J. Chem. Inf. Comput. Sci. 2003, 43, 493−500. (22) Yap, C. H. J. Comput. Chem. 2011, 32, 1466−1474. (23) Todeschini, R.; Consonni, V. Molecular descriptors for chemoinformatics, 2nd ed.; Wiley: New York, 2009. (24) Platts, J. A.; Butina, D.; Abraham, M. H.; Hersey, A. J. Chem. Inf. Model. 1999, 39, 835−845. (25) Roy, K.; Ghosh, G. Internet Electron. J. Mol. Des. 2003, 2, 599− 620. (26) van Weerden, E. J.; Huisman, J. Br. J. Nutr. 1993, 69, 455−466.

ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.analchem.7b01010. Legends of supplementary figures and tables; (2) details of samples and experimental procedures; (3) details of noise reduction using the calibration curve of dilution series of QC samples; (4) discussion of chemical properties used in retention index predictions (PDF) Figure S1 (PDF) Tables S1−S3 (XLSX)



AUTHOR INFORMATION

Corresponding Authors

*E-mail: [email protected]. Tel.: +81-45-503-9491. *E-mail: [email protected]. Tel.: +81-6-68797424. ORCID

Hiroshi Tsugawa: 0000-0002-2015-3958 Author Contributions #

T.M. and H.T. contributed equally to this research.

Author Contributions

T.M., H.T., and E.F. designed the study. H.T. wrote the source code for the MS-FINDER spectral search engine, MetaboloDerivatizer, and the statistical analysis software. T.M. and H.T. manually confirmed the identification result of MS-DIAL. T.M. performed the experiments. T.M. and H.M. checked the derivative form of original structures included in our in-house database. H.T. performed the curation of metabolome table, hypothesis testing, and FDR evaluation. T.M., H.T., and E.F. thoroughly discussed the project. T.M. and H.T. wrote the manuscript, and other authors contributed to manuscript editing. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This work was supported by the AMED-Core Research for Evolutionary Science and Technology (AMED-CREST). H.T. was supported by a grant-in-aid for scientific research (C) 15K01812. The study represents a portion of the dissertation submitted by Teruko Matsuo to Osaka University in partial fulfillment of the requirement for her Ph.D.



REFERENCES

(1) Cajka, T.; Fiehn, O. Anal. Chem. 2016, 88, 524−545. (2) Tsugawa, H.; Cajka, T.; Kind, T.; Ma, Y.; Higgins, B.; Ikeda, K.; Kanazawa, M.; VanderGheynst, J.; Fiehn, O.; Arita, M. Nat. Methods 2015, 12, 523−526. (3) Johnson, C. H.; Ivanisevic, J.; Siuzdak, G. Nat. Rev. Mol. Cell Biol. 2016, 17, 451−459. (4) Tsugawa, H.; Tsujimoto, Y.; Arita, M.; Bamba, T.; Fukusaki, E. BMC Bioinf. 2011, 12, 131. (5) Lai, Z.; Fiehn, O. Mass Spectrom. Rev. 2016, 9999, 1−13. (6) Horai, H.; Arita, M.; Kanaya, S.; Nihei, Y.; Ikeda, T.; Suwa, K.; Ojima, Y.; Tanaka, K.; Tanaka, S.; Aoshima, K.; Oda, Y.; Kakazu, Y.; Kusano, M.; Tohge, T.; Matsuda, F.; Sawada, Y.; Hirai, M. Y.; Nakanishi, H.; Ikeda, K.; Akimoto, N.; Maoka, T.; Takahashi, H.; Ara, T.; Sakurai, N.; Suzuki, H.; Shibata, D.; Neumann, S.; Iida, T.; Tanaka, K.; Funatsu, K.; Matsuura, F.; Soga, T.; Taguchi, R.; Saito, K.; Nishioka, T. J. Mass Spectrom. 2010, 45, 703−714. 6773

DOI: 10.1021/acs.analchem.7b01010 Anal. Chem. 2017, 89, 6766−6773