Integrated Strategy for Unknown EI–MS ... - ACS Publications

May 18, 2017 - RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, ... While the total count of EI–MS records i...
0 downloads 3 Views 1MB Size
Subscriber access provided by CORNELL UNIVERSITY LIBRARY

Article

Integrated strategy for unknown EI–MS identification using quality control calibration curve, multivariate analysis, EI–MS spectral database, and retention index prediction Teruko Matsuo, Hiroshi Tsugawa, Hiromi Miyagawa, and Eiichiro Fukusaki Anal. Chem., Just Accepted Manuscript • Publication Date (Web): 18 May 2017 Downloaded from http://pubs.acs.org on May 21, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Analytical Chemistry is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Integrated strategy for unknown EI–MS identification using quality control calibration curve, multivariate analysis, EI–MS spectral database, and retention index prediction Author names Teruko Matsuo1,#, Hiroshi Tsugawa1,2,3,#,*, Hiromi Miyagawa4, Eiichiro Fukusaki1,* Author affiliations 1

Department of Biotechnology, Graduate School of Engineering, Osaka University, Suita, Osaka 565-

0871, Japan 2

RIKEN Center for Sustainable Resource Science, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama,

Kanagawa, 230-0045, Japan 3

RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama,

Kanagawa 230-0045, Japan 4

GL Sciences Inc., Iruma, Saitama 358-0032, Japan

#Teruko Matsuo and Hiroshi Tsugawa contributed equally to this research. *Co-corresponding authors Corresponding authors Hiroshi Tsugawa: [email protected] RIKEN Center for Sustainable Resource Science, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa, 230-0045, Japan +81-45-503-9491

Eiichiro Fukusaki: [email protected] Department of Biotechnology, Graduate School of Engineering, Osaka University, 2–1 Yamadaoka, Suita, Osaka 565-0871, Japan +81-6-6879-7424 Keywords: metabolomics, structure elucidation, spectral database, retention index prediction 1 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Abstract Compound identification using unknown electron ionization (EI) mass spectra in gas chromatography coupled with mass spectrometry (GC–MS) is challenging in untargeted metabolomics, natural product chemistry, or exposome research. While the total count of EI–MS records included in publicly or commercially available databases is over 900,000, efficient use of this huge database has not been achieved in metabolomics. Therefore, we proposed a ‘four-step’ strategy for the identification of biologically significant metabolites using an integrated cheminformatics approach: (i) Quality control calibration curve to reduce background noise, (ii) variable selection by hypothesis testing in principal component analysis for the efficient selection of target peaks, (iii) searching the EI–MS spectral database, and (iv) retention index (RI) filtering in combination with RI predictions. In this study, the new MS-FINDER spectral search engine was developed and utilized for searching EI–MS databases using mass spectral similarity with the evaluation of false discovery rate. Moreover, in silico derivatization software, MetaboloDerivatizer, was developed to calculate the chemical properties of derivative compounds, and all retention indexes in EI–MS databases were predicted using a simple mathematical model. The strategy was showcased in the identification of three novel metabolites (butane-1,2,3-triol, 3-deoxyglucosone, and palatinitol) in Chinese medicine Senkyu for quality assessment, as validated using authentic standard compounds. All tools and curated public EI–MS databases are freely available in the ‘Computational MS-based metabolomics’ section of the RIKEN PRIMe website (http://prime.psc.riken.jp).

2 ACS Paragon Plus Environment

Page 2 of 25

Page 3 of 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Introduction Mass spectrometry (MS)-oriented untargeted metabolomics simultaneously provides ion abundance information for several hundred metabolites, including both identified and unidentified metabolites, in biological samples, known as a ‘metabolome table’, which is used for further statistical analyses in various medical or biological sciences.1–3 Gas chromatography coupled with electron ionization (EI) mass spectrometry (GC–MS) is a popular technique for untargeted metabolomics.4–5 The workflow of GC–MS based metabolomics allows 500–1000 chromatographic peaks to be obtained in a single run with highly stable retention index (RI) and EI–MS spectral information. However, of these, only 100– 200 small biomolecules, representing 20% of the unique chromatographic peaks, can be identified based on similarity matching with EI spectra and retention indices (RIs) of reference compounds. Unfortunately, there is no perfect solution for identifying whole chromatographic peaks, yet studies tackling this issue have been reported, as described below. Therefore, novel and more efficient strategies for the structural elucidation of unknown EI–MS spectra are required in metabolomics. A basic approach for annotating unknown metabolites uses the vast amount of EI spectral databases, including MassBank,6 and commercially available NIST and Wiley databases. Currently, there are 15,302 experimental spectra records of 9,003 unique structures in the MassBank of North America (MoNA), 276,259 experimental spectra records of 242,477 unique structures in NIST 14, and 719,456 experimental spectra records of 583,059 unique structures in the Wiley 10th spectra database. So far, the successful identification of small molecules has been achieved in many researches by the combination of NIST (+Wiley and MassBank) and AMDIS software which has frequently been used for GC–MS spectral deconvolution followed by spectral searching with a certain RI tolerance.7 Additionally, the prediction of EI spectra has recently been reported to expand their capability.8 Databases and related programs could assist greatly in annotating unknown EI spectra. However, searching spectral databases or using in silico spectral prediction often provides many false positive candidates, and in fact, analyzing all authentic standards for candidates is unrealistic. The great advantage of GC–MS when compared to LC–MS is the reusability of robust- and accurate retention markers, i.e. RI, which have been stored for decades by some famous methods including Kovats9, Lee10, and Fiehn11 indices, where the conversion is also available as described in Stephen et al7 or in MoNA database. Their experimental RI values have effectively been utilized for the reliable identification of biomolecules, enabling one to construct the highly scalable GC–MS metabolome databases such as BinBase12 and Golm DB13. However, the RI information is not always 3 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 25

available, and therefore, the methodology for RI prediction is required for the comprehensive use of experimental records as several groups have reported.14,15 Furthermore, narrowing down the unknown candidates in an analytical and statistical manner is also important before further annotation process. For example, quality control (QC) samples, a mixture of small aliquots from each biological sample where all molecules in each sample can be detected, have been utilized to filter out chromatographic noise.16 Additionally, the methodology for variable selection in multivariate analysis has been developed to objectively obtain biologically important metabolites using hypothesis testing.17 Therefore, an integrated cheminformatics approach using QC samples, statistical analysis, spectral searches, and retention index prediction needs to be developed for efficient use of EI–MS databases to identify biologically significant metabolites. In this study, we propose a four-step strategy for compound identification of unknown GC–MS peaks (Figure 1). Firstly, noise chromatographic peaks are removed using a calibration curve from a dilution series of a pooled QC sample (QC curve filtering), which gives a curated metabolome table. The basic concept of this QC curve filtering is that the ion abundance of biologically meaningful chromatographic peaks should fluctuate accordingly in the dilution series of a pooled QC sample, while background noise peaks from derivative reagents, column bleeds, and other reactants should not be relevant to the QC curve. Secondly, variable selection in principal component analysis (PCA) is objectively performed by hypothesis testing. Thirdly, searching the EI spectra database for unknown candidates is executed using the newly developed MS-FINDER spectral search engine, where the publicly available GC–MS records are implemented (15,302 records at MS-FINDER version 2.12) while the Wiley 10th database was also imported independently as ‘user-defined library’ in this study. Finally, we provide a retention index prediction approach using a quantitative structure-retention relationship (QSRR) method in combination with newly developed in silico derivatization software MetaboloDerivatizer to filter out remaining false positive candidates. Herein, we demonstrate this strategy using 64 GC–MS data files for Chinese medicine Senkyu which is the dried root of Conidium officinale Makino in Japan utilized as a herbal medicine for menopause, poor circulation, drainage, and skin disease. Our approach discovered three new biomolecules in Senkyu, which will give us novel insights to understand the medicinal effects and contribute as quality markers to judge their biological origins.

4 ACS Paragon Plus Environment

Page 5 of 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Experimental section Samples and experimental procedures Six types of Chinese medicine Senkyu were supplied by Tochimoto Tenkaido Co. Ltd. (Osaka, Japan), and their origins were defined by species (Cnidium officinate Makino or Ligusticum chuanxiong Hort.), cultivation area (Japan or China), and manufacturing process (with or without steaming), as summarized in Supplementary Table S1. A 10 mg sample of each dried root was weighed and homogenized in a 2-mL Eppendorf tube. For quality control samples, 10 mg of each sample was added into a 15-mL tube (total 360 mg; six biological samples and six biological replicates) and then divided into six 2-mL Eppendorf tubes such that they contained 10 mg, 20 mg, 30 mg, 40 mg, 50 mg, and 60 mg, respectively. An empty tube was also prepared, labeled as 0 mg QC, for monitoring background effects. Metabolite extraction, derivatization, and GC–MS analysis followed reported methods (Supporting Information).18,19

Data processing NetCDF format files exported from GCMSsolutions (Shimadzu Co., Kyoto, Japan) were converted to Analysis Base Framework (ABF) format files using a free ABF file converter (http://www.reifycs.com/AbfConverter/index.html). MS-DIAL (version 2.48) software was downloaded from the RIKEN PRIMe website and used for data processing of the GC–MS dataset. The parameters were set as follows: smoothing level, 3; minimum peak height, 2000; average peak width, 20; with default parameters used for the others. The MSP format file (EI–MS reference library) was created using our in-house database, and can also be downloaded from RIKEN PRIMe website (entitled Osaka Univ. DB). Note that the ranking of structure candidates was based on mass spectral similarity, which was the total score of dot product, reverse dot product, and existence percentage of fragment ions (weighted 2:2:1, respectively) in combination with RI similarity. Details of their mathematical functions followed our previous report.2 After automatic data processing was finished, the identification results were manually curated with the MS-DIAL graphical user interface by a GC–MS expert, where false positive identification results were changed to ‘unknown’. A total of 1,975 chromatographic peaks (labeled as aligned spots) were created, comprising 127 identified and 1,848 unknown peaks (Supplementary Table S2). All GC–MS data files can be downloaded from the RIKEN Dropmet website (http://prime.psc.riken.jp/?action=drop_index).

5 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 25

MS-FINDER spectral search engine for using mass spectral databases The spectral search engine was implemented in MS-FINDER software. EI–MS spectral records from MassBank and MoNA were implemented internally as the default database for the search engine. The Wiley 10th mass spectral records were prepared as MSP format, and it was also imported in MSFINDER as ‘user-defined MSP database’. Note that we excluded the spectral records of Wiley where the SMILES code was not recognized by ChemAxon molconverter (https://www.chemaxon.com/): a total of 668,231 spectral records was used in this study. The NIST14 EI–MS spectra were not utilized because they could not be converted to ASCII format by their Lib2NIST converter to be used in MS-FINDER. SMILES and InChIKey for all internal records were generated from the InChI code, CAS number, or (sometimes) chemical name information using ChemAxon molconverter for InChI code or Chemical Translation Service20 for CAS number and chemical name. Then, all of molecular structures were computationally derivatized as shown in next section for their RI predictions. The spectral similarity is used for structure ranking by the same as the method of MS-DIAL (see above). Note that only the experimental EI-MS spectra were utilized in this study, and the publicly distributable spectral records, i.e. MassBank and MoNA, were available at RIKEN PRIMe website. In silico derivatization for MeOX and TMS reaction of small molecules In silico derivatization software, MetaboloDerivatizer, developed in C# language, converted the target biomolecules into the methoxy (MeOX) and trimethylsilylated (TMS) forms. The chemistry development kit21 was used to recognize the SMILES code of a target molecule as a list of atom and bond connectivity, and to export the reacted SMILES code from the list of atom and bond connectivity. For the MeOX reaction, the carbonyl groups (a carbon–oxygen double bond) in ketone or aldehyde moieties were recognized as reactive C=O, and converted to C=N–OCH3. For the TMS reaction, all acidic protons in OH, COOH, NRH, and SH moieties were converted to O–TMS, COO–TMS, NR–TMS, and S–TMS, respectively. In contrast, the reaction of primary amines was optional, with users able to select the degree of conversion as –N(TMS)2, –NH–TMS, and –NH2, with reaction with one TMS, i.e. – NH–TMS, used as the default setting. The derivative form could be exported as the SMILES code and batch conversion was also available. Retention index prediction

6 ACS Paragon Plus Environment

Page 7 of 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Our in-house spectral library was used to construct the retention index (RI) regression model. The derivatization form of compound structures was automatically created by MetaboloDerivatizer. For metabolites containing primary amines, all possible TMS forms, as described above, were prepared. The determination of derivative isomers from the reaction of primary amines, such as Valine 1TMS, 2TMS, and 3TMS, and from the cis/trans configuration of MeOX reaction, were manually performed by GC– MS experts with literature and database mining, and substantial manual analysis. While our library contained 430 spectral records, we used 337 spectral records of these, where we could believe their derivate forms (also see Result and Discussion). Molecular descriptors of the derivative SMILES code were computed by PaDEL-descriptor version 2.18.22 Furthermore, the correlation coefficient between the RI value and each descriptor value was calculated, and descriptors with correlation coefficient values smaller than 0.8 were excluded (Supplementary Table S3). A multiple regression model was constructed by statistical language R version 3.2.2, and the ‘forward-step’ function, which applies the Akaike Information Criterion (AIC) in combination with a cross validation method for model selection, was used to determine important variables (see Result and Discussion).

7 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 25

Result and Discussion The four-step strategy reported herein used 64 GC–MS data of Chinese medicine Senkyu (six different origins, n = 6; seven concentration ranges of QC samples, n = 4), as described in Supplementary Table S1. Although a metabolomics approach to objectively distinguishing Senkyu origin has been suggested previously,18 more than 80% of the chromatographic peaks have not been utilized for marker discovery because of the lack of compound information. Therefore, the same biological sets were used in this paper to promote our strategy for discovering novel biomarkers. Step 1. Data reduction using QC curve filtering MS-DIAL automatically generated the metabolome table, which has 1,974 features × 36 sample and 28 QCs (Supplementary Figure S1). However, chromatographic features are not always derived from intermediate metabolites of the living organism. Especially in GC–MS, background peaks derived from derivative reagents, procedure-related artifacts, and column bleeding must be removed before statistical analysis. Some artifact EI–MS spectra can be registered in spectral library with features identified as artifact peaks excluded from the metabolome table. Furthermore, an ad hoc threshold against peak intensity was utilized to remove low signal-to-noise ratio features. Here, we introduce an experimental data dependent method, QC curve filtering, which can objectively obtain biologically related metabolites from the initial metabolome table. The QC curve filtering hypothesis is that ion abundances of intermediate metabolites should be linearly related to the increase of concentration in diluted QC samples. An example of a QC curve result is described in Figures 2a and 2b. Ion abundances of diluted QC samples in alignment ID 378 (defined by MS-DIAL) increased linearly along with the diluted series, indicating that alignment ID 378 was derived from the intermediate metabolite of Senkyu (Figure 2a). In contrast, ion abundances of QC samples in alignment ID 497 were mostly similar among the diluted QCs, indicating that the chromatographic peaks were recognized as artifacts arising from a background substance (Figure 2b). Importantly, all chromatographic peak shapes obtained in this study were not saturated, both in the GC column and MS detector. Artifact peaks were excluded if aligned spot chromatographic peaks did not increase continually with the four concentration ranges, or if the average relative standard deviation in each diluted QC stage was greater than 20 (also see Supporting Information). After exclusion, the number of features was 457, comprising 127 identified and 330 unknown peaks.

8 ACS Paragon Plus Environment

Page 9 of 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Step 2. Selection of biologically important metabolites using principal component analysis Principal component analysis is the first choice in multivariate analysis in metabolomics for detecting outliers and gaining insights into metabolome phenotypes related to the biological phenotypes. In this case, the autoscaled data matrix was applied to PCA, and the first and second principal component axes (PC1 and PC2) were closely related to differences between cultivation countries (Japan and China), where the metabolome phenotype of C. officinale combined with the steaming procedure was substantially different to other origins, as described in PC2 (Figure 3a). In this study, we utilized a hypothesis testing method17 to objectively obtain biologically important metabolites related to PC1 and PC2 axes. Briefly, loading values are defined as the correlation coefficient between the score and raw matrix values, and the statistical test is based on the hypothesis that the distribution of correlation coefficient values follows the T-distribution of an (n–2) degree of freedom. We implemented this PCA methodology in our open-source Excel macro (available at http://prime.psc.riken.jp/Metabolomics_Software/StatisticalAnalysisOnMicrosoftExcel/index.html). The scatter plot in Figure 3b indicates that 245 chromatographic peaks (black), including 170 unknown metabolites, were recognized as significant metabolites contributing to the PC1 and PC2 axes. Steps 3 and 4. Spectral database-oriented compound search with retention index prediction Firstly, we addressed the curation of MassBank and MoNA databases to define the structure data by InChIKey, to examine the reproducibility of RI values among different laboratories, and to evaluate the false discovery rate (FDR) using wider retention index tolerance for searching structures. After the InChIKey values were generated for all EI–MS records, we examined the metabolite subset relationships among laboratories contributing to public repositories: RIKEN (Prefix PR in MassBank), Kazusa (Prefix KZ in MassBank), Osaka University (Prefix OUF in MassBank), GL-science (Prefix GLS in MassBank), and UC Davis (BinBase in MoNA), which contained information of retention index (Figure 4a). Notably, the Fiehn RI values from the Fiehn BinBase records were converted to Kovats RI values using a simple regression model constructed by analyzing a mixture of fatty acid methyl esters and alkanes. It should also be noted that the first layer of InChIKey was utilized for structure definition, although some stereoisomers (e.g., glucose and galactose) can be distinguished by GC–MS. The Fiehn BinBase library was the top contributor among public resources (A in Figure 4a), containing a total of 259 unique compound records. In contrast, GL-Science (B) and Osaka University (D; our library) repositories contained 51 and 30 unique structures, respectively, and shared 96 unique structural records. 9 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 25

The results in Figure 4a motivated us to use the RI and EI–MS spectral records of the other institutions in addition to our library (330 unique structures), covering a total of 829 unique structures. Therefore, RI reproducibility was examined for 70 unique structures shared by five research institutes (Figure 4b), of which 14 sugar compounds were excluded due to difficulties in structure definition. This evaluation is important for understanding reliable RI tolerance in spectra searching. As a result, we found that the standard deviations among publicly available experimental RIs were less than 50 (0.5 carbon chain length difference in Kovats index), except for the Kazusa DNA institute. In fact, the commercial NIST database containing experimental RI values for over 40,000 unique structures can also be utilized by using NIST MS search software, where the records from slightly polar columns such as DB-5 can be used at least in our case. Next, we addressed the study of retention index prediction to utilize not only the remaining 7,774 structure records in MassBank and MoNA, but also the information of Wiley commercial library here (Figure 4c). First of all, the regression model for RI prediction was constructed by our library containing 430 records. All structural data in our library was converted to MeOX and TMS derivatives using MetaboloDerivatizer, where reactive C=O moieties were converted to MeOX (E/Z isomer was manually determined) and all of acidic protons except for primary amines was converted to TMS. For total 131 metabolites of 430 records which contain the primary amine (–NH2) moiety, the conversion was performed by three options, i.e. from –NH2 to –N(TMS)2, to –NH–TMS, or to –NH2 (see Experimental section): total 612 derivative forms were created here. After substantial manual efforts, the reliable RI- and derivative form pairs of 337 records that we could define were utilized for further analysis (Supplementary Table S3). Importantly, the wider variety of MeOX or TMS derivative forms can be monitored in GC-MS experiments: even in glucose, where 5TMS is the most stable form, the 4TMS form is sometimes monitored. The curation for dealing with their derivative isomers will facilitate the reliable identification in GC–MS metabolomics, and the GC–MS databases such as the Fiehnlib12 and Golm DB13 can assist their curations. A total of 4,566 descriptors was generated by PaDEL descriptor version 2.18.22 As a crossvalidation approach, the RI and descriptor pairs were sorted by RI values, with odd and even ordering records used as Set 1 (169 pairs) and Set 2 (168 pairs), respectively. When Set 1 was used as the training set, Set 2 was utilized as the test set, and vice versa. The descriptors (ATSc123, topoDiameter23, MLFER_L24, and ETA_Beta25) were commonly selected as the top four important variables in both models, i.e. Set 1 (training) → Set 2 (test) and Set 2 (training) → and Set 1 (test). The coefficients and 10 ACS Paragon Plus Environment

Page 11 of 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

intercept are shown in Figure 4c. The standard deviation (SD) and R-square (R2) values of the first set (Set 1 → Set 2) and second set (Set 2 → Set 1) were 78 and 0.93 in Set 1 → Set 2, and 88 and 0.93 in Set 2 → Set 1, respectively. Considering their chemical properties (see Supporting Information), the retention index (RI) of TMS-derivative compounds can be predicted by reflections of substructural repeatability (ATSc1), gas-hexadecane interaction (MLFER_L), electronic state (ETA_Beta), and molecules size (topoDiameter), at least under our GC conditions (5% diphenyl and 95% dimethyl polysiloxane column with a linear gradient condition), where the standard deviation was less than 100 (one carbon chain length difference in the Kovats index) while the reusability of experimental RI values was shown as less than 50. Although the accuracy is not sufficient for ‘identification’ of metabolites based on retention index only, this approach can be utilized as a ‘filter’ for the exclusion of many false positive candidates and the combination with the EI–MS matching will lead us to the reliable annotation of unknown metabolites as shown in the showcase of three metabolites. Moreover, this approach compensates for the lack of information of MeOX- and TMS derivative forms in NIST database, resulting in the accuracy improvement of RI predictions for derivatized metabolites compared to 300– 400 Kovats unit errors (95% confident interval) which are often generated by NIST RI estimation program.14 We applied this regression model to 11,111 MassBank- and 668,231 Wiley records not containing RI information. The derivatization was simply performed by the optimal parameter of MetaboliteDetector where the primary amine is converted as 1TMS (–NH–TMS): a total of 679,342 derivative forms was exported, and then their retention indexes were predicted via the regression model. Evaluation of false discovery rate When using the expanded spectral library, the RI tolerance for searching structures should be spread for public experimental RIs (±50 RI) and for predicted RIs (±100 RI), in contrast to the validated in-house RIs (less than 10 RI). Therefore, it was important to estimate the false discovery rate (FDR) in searching spectral databases, along with the wider RI tolerance setting. We utilized a total of 127 spectra identified in Chinese medicine Senkyu for the validation kit to evaluate FDRs. The RI and EI spectra of these compounds were manually validated by a mass spectrometry specialist with our in-house library. Furthermore, a total of 13,570 spectra containing all MassBank and MoNA records were used as the search space (Figure 4d). As a result, the FDR values when searching other experimental RIs (tolerance 50) and predicted RIs (tolerance 100) were estimated 11 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 25

as 16.5% and 37.0%, respectively, to distinguish the stereoisomers. In contrast, these FDR values were decreased for the determination of molecular skeleton (atom and bond connectivity described by the first layer of InChIKey) to less than 9.4% and 14.1% in 50 and 100 RI tolerances, respectively. This result suggested that analyzing authentic standards was essential, but the efficiency in obtaining the correct structure was improved by this RI filtering compared to results without the RI criteria (19%). Furthermore, the computational cost (retrieval frequency of spectral records) was reduced: the total count of spectral search trials for 127 queries to 13,570 spectra by using 50, 100, and infinite RI tolerances was approximately 100,000, 200,000, and 1,700,000, respectively. Identification of three metabolites Our four-step strategy was showcased in the identification of three new metabolites not stored in our own library and not reported for Chinese medicine Senkyu (Figure 5). It should be noted this strategy is not suitable for the identification of metabolites that are truly new compounds (unknown-unknowns), but for the practical annotation of unknown EI–MS spectra using spectral databases. We focused on three unknown EI–MS spectra, assigned as ID 301, 761, and 1538, in MS-DIAL. These unknowns were some of 170 unknown peaks obtained from PCA hypothesis testing. A total of 682,248 EI–MS spectra, including all private, public, and commercial records, were imported into the MS-FINDER spectral search engine. Note that the library contains the EI–MS spectra not only from MeOX-TMS derivative forms, but also from other derivative types in addition to underivatized structures: in this study, we utilized the entire records because it is a tremendous task to investigate the background information for over 680,000 records. The cut-off value for spectral similarity was set to 80%. Spectral searching gave a total of 46, 5, and 64 structure candidates for ID 301, 761, and 1538, respectively. The RI values for EI–MS records with no RI information were predicted as described in above Step 3 and 4. The cut-off value for RI filtering was set to 50 (0.5 carbon chain length difference in the Kovats index) where the FDR value was estimated as 9.4% according to the previous section. With this RI filter in addition to spectral matching, the number of structure candidates became 4, 2, and 5 for ID 301, 761, and 1538, respectively. Commercially available standard compounds were purchased and these candidates were finally identified as butane-1,2,3-triol, 3-deoxyglucosone, and palatinitol, respectively (Figure 5). Interestingly, palatinitol, which is known as a growth factor of Bifidobacterium in the human microbiome,26 was highly accumulated in Cnidium officinale Makino that had undergone a steaming 12 ACS Paragon Plus Environment

Page 13 of 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

process, which is recognized as the general product of Chinese medicine Senkyu in Japan. As the Chinese medicine Senkyu is used to improve gastrointestinal function, this result indicated that palatinitol was the bioactive compound for improving the human microbiome by effecting Bifidobacterium growth. In contrast, butane-1,2,3-triol can be used as a quality marker to distinguish Senkyu imitations on the Japanese market. Importantly, these compounds were efficiently identified at low cost, and with little time and effort by the four-step strategy. In fact, we had tried to annotate additional 10 chromatographic peaks in the 170 biologically important-unknown peaks, where the top candidate suggested by MS-FINDER has greater than 90% spectra similarity with the reference spectra. Of these, four peaks (ID 76, 297, 1216, and 1527) were investigated by the authentic standard compounds, for which sugar-related structures were annotated. Unfortunately, the retention indexes were slightly different from these unknown peaks. Nevertheless, these results suggested that our strategy efficiently led us to annotate novel metabolites because the ‘challenge’ of authentic standard experiments was succeeded with 43% accuracy (three / seven).

13 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 25

Conclusion We have demonstrated an efficient four-step strategy using EI–MS databases for metabolite annotation of unknown EI–MS spectra. QC curve filtering and hypothesis testing in principal component analysis efficiently extracted reliable chromatographic peaks that could be considered as intermediate and biologically important metabolites. MS-FINDER spectral search engine contributed in identifying structure candidates using spectral similarities. Finally, the prediction of retention indices was considerably improved by narrowing down structure candidates. This strategy can be used as a useful guide for the identification of unknown EI–MS spectra by using publicly- or commercially available spectral records containing over 900,000 mass spectra. Contributions T.M., H.T., and E.F. designed the study. H.T. wrote the source code for the MS-FINDER spectral search engine, MetaboloDerivatizer, and the statistical analysis software. T.M. and H.T. manually confirmed the identification result of MS-DIAL. T.M. performed the experiments. T.M. and H.M. checked the derivative form of original structures included in our in-house database. H.T. performed the curation of metabolome table, hypothesis testing, and FDR evaluation. T.M., H.T., and E.F. thoroughly discussed the project. T.M. and H.T. wrote the manuscript, and other authors contributed to manuscript editing. Acknowledgements This work was supported by the AMED-Core Research for Evolutionary Science and Technology (AMED-CREST). H.T. was supported by a grant-in-aid for scientific research (C) 15K01812. The study represents a portion of the dissertation submitted by Teruko Matsuo to Osaka University in partial fulfillment of the requirement for her PhD.

14 ACS Paragon Plus Environment

Page 15 of 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

References 1.

Cajka, T.; Fiehn, O. Anal. Chem. 2016, 88, 524–545.

2.

Tsugawa, H.; Cajka, T.; Kind, T.; Ma, Y.; Higgins, B.; Ikeda, K.; Kanazawa, M.; Vander, G. J.; Fiehn, O.; Arita, M. Nat. Methods, 2015, 12, 523–526.

3.

Johnson, C.H.; Ivanisevic, J.; Siuzdak, G. Nat. Rev. Mol. Cell Biol. 2016, 17, 451–459.

4.

Tsugawa, H.; Tsujimoto, Y.; Arita, M.; Bamba, T.; Fukusaki E. BMC Bioinf. 2011, 12, 131.

5.

Lai, Z.; Fiehn, O. Mass Spectrom. Rev. 2016, 9999, 1-13.

6.

Horai, H.; Arita, M.; Kanaya, S.; Nihei, Y.; Ikeda, T.; Suwa, K.; Ojima, Y.; Tanaka, K.; Tanaka, S.; Aoshima, K.; Oda, Y.; Kakazu, Y.; Kusano, M.; Tohge, T.; Matsuda, F.; Sawada, Y.; Hirai, MY.; Nakanishi, H.; Ikeda, K.; Akimoto, N.; Maoka, T.; Takahashi, H.; Ara, T.; Sakurai, N.; Suzuki, H.; Shibata, D.; Neumann, S.; Iida, T.; Tanaka, K.; Funatsu, K.; Matsuura, F.; Soga, T.; Taguchi, R.; Saito, K.; Nishioka, T. J. Mass Spectrom. 2010, 45, 703–714.

7.

Stein, S. Anal. Chem. 2012, 84, 7274–7282.

8.

Allen, F.; Pon, A.; Greiner, R.; Wishart, DS. Anal. Chem. 2016, 88, 7689-7697.

9.

Kovats, E. Helv. Chim. Acta 1958, 41, 1915–1932.

10. Lee, M. L.; Vasslaros D. L. Anal. Chem. 1979, 51, 768–773. 11. Kind, T.; Wohlgemuth, G.; Lee, D. Y.; Lu, Y.; Palazoglu, M.; Shahbaz, S.; Fiehn, O. Anal. Chem. 2009, 81, 10038–10048. 12. Fiehn, O.; Wohlgemuth, G.; Scholz, M. Lect. Notes Comput. Sci. 2005, 3615, 224. 13. Kopka, J.; Schauer, N.; Krueger, S.; Birkemeyer, C.; Usadel, B.; Bergmüller, E.; Dörmann, P.; Weckwerth, W.; Gibon, Y.; Stitt, M.; Willmitzer, L.; Fernie, A. R.; Steinhauser, D. Bioinformatics 2005, 21, 1635–1638. 14. Stein, SE.; Babushok, VI.; Brown, RL.; Linstrom, PJ. J. Chem. Inf. Model, 2007, 47, 975–980. 15. Kumari, S.; Stevens, D.; Kind, T.; Denkert, C.; Fiehn, O. Anal Chem, 2011, 83, 5895–5902. 16. Want, E.J.; Masson, P.; Michopoulos, F.; Wilson, I.D.; Theodoridis, G.; Plumb, R.S.; Shockcor, J.; Loftus, N.; Holmes, E.; Nicholson, J.K. Nat. Protoc., 2013, 8, 17–32. 17. Yamamoto, H.; Fujimori, T.; Sato, H.; Ishikawa, G.; Kami, K.; Ohashi, Y. BMC Bioinf. 2014, 15, 51. 18. Kobayashi, S.; Nagasawa, S.; Yamamoto, Y.; Donghyo, K.; Bamba, T.; Fukusaki, E. J. Biosci. Bioeng. 2012, 114, 86-91.

15 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 25

19. Tsugawa, H.; Bamba, T.; Shinohara, M.; Nishiumi, S.; Yoshida, M.; Fukusaki, E. J. Biosci. Bioeng. 2011, 112, 292-298. 20. Wohlgemuth, G.; Haldiya, P. K.; Willighagen, E.; Kind, T.; Fiehn, O. Bioinformatics 2010, 26, 2647-2648. 21. Steinbeck, C.; Han, Y.; Kuhn, S.; Horlacher, O.; Luttmann, E.; Willighagen, E. J. Chem. Inf. Comput. Sci. 2003, 43, 493–500. 22. Yap, C. H. J. Comput. Chem. 2011, 32, 1466-1474 23. Todeschini, R.; Consonni, V. Molecular descriptors for chemoinformatics, Weinheim, 2nd ed.; Wiley: New York, 2009. 24. Platts, J. A.; Butina, D.; Abraham, M. H.; Hersey, A. J. Chem. Inf. Model., 1999, 39, 835-845. 25. Roy, K.; Ghosh, G. Internet Electron. J. Mol. Des., 2003, 2, 599–620. 26. van Weerden, E. J.; Huisman, J. Br. J. Nutr.,1993, 69, 455-466

16 ACS Paragon Plus Environment

Page 17 of 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Figure legends Figure 1. Four-step strategy for the identification of unknown EI–MS spectra. Step 1: After the raw MS dataset was processed, the metabolome table was curated by QC curve filter, which removed the procedure artifacts or chromatographic noises. Step 2: Hypothesis testing in principal component analysis was used to obtain biologically meaningful chromatographic peaks. Step 3: EI–MS database oriented structure elucidation was performed based on spectral similarity matching. Step 4: After retention index predictions were generated by multiple regression analysis, most false positive candidates were excluded by RI filtering. Finally, commercially available and/or synthesized compounds were analyzed to validate metabolite annotations. Figure 2. Quality control calibration curve filter (a) Example of chromatographic feature passing the QC curve filter. Ion abundances of alignment ID 378 were raised, along with the increase in QC concentrations, which was recognized as the intermediate metabolite. (b) Example of chromatographic feature not passing the QC curve filter, which was recognized as an artifact in GC–MS analysis because the ion abundances of alignment ID 497 were mostly equal among the QC diluted series. Figure 3. Variable selection in principal component analysis. (a) Principal components 1 and 2 were interpreted as the axes, showing the differences between C. officinale and L. chuanxiong, and the production areas (Japan or China), respectively. Abbreviations: CJSD, C. officinale, Japan, Steaming and Dry; CJD, C. officinale, Japan, Dry only; CCSD, C. officinale, China, Steaming and Dry; CCD, C. officinale, China, Dry only; LCSD, L. chuanxiong, China, Steaming and Dry; LCD, L. chuanxiong, China, Dry only. (b) Loading plot with significant chromatographic features indicated by solid black circles. The p-values of hypothesis testing were corrected using Bonferroni’s method: black-filled circle stands for significant features. Figure 4. Database curation for MassBank and MoNA. (a) Venn diagram showing overlaps among published structures containing retention index (RI) information, created using the first layer of InChIKey. ‘Others’ denotes MassBank or MoNA records not containing RI information. (b) Reproducibility of RI values, examined by overlapped (ABCDE) 17 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 25

metabolites. The y-axis shows the RI difference between Osaka University (D) and the others. The standard deviation (SD) of the RI differences was also calculated. (c) Total of 337 RI and descriptor pairs was divided in two (Set 1 and Set 2), with one used as the training set, and the other as the test set: Set 1 → Set 2 means that Sets 1 and 2 were used as training and test sets, respectively. Mathematical equation for RI prediction is shown at the bottom, with R squared and SD values also described. (d) False discovery rate (FDR) evaluated from 127 identified spectra, in combination with publicly available EI–MS records. The x- and y axes represent RI tolerance and FDR value, respectively. Black and red lines are identification results for full layer and first layer matching of InChIKey, respectively. Figure 5. Identification of three metabolites via spectral searching and RI filtering. EI–MS spectra of three biologically significant chromatographic peaks were subjected to candidate searching in MSFINDER software with an 80% cut-off of spectral similarity. The structure candidates were further refined by RI filtering, resulting in the number of structure candidates for ID 301, 761, and 1538 changing from 46, 5, and 64 from spectral searching to 4, 2, and 5 from RI filtering, respectively. Identification was finally performed using commercially available authentic standard compounds.

18 ACS Paragon Plus Environment

Page 19 of 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Supporting information Supporting Information Available SI manuscript: (1) Legends of supplementary figures and tables, (2) detail of samples and experimental procedures, (3) detail of noise reduction using the calibration curve of dilution series of QC samples, and (4) discussion of chemical properties used in retention index predictions. Supplementary figures and tables: Figure S1, Table S1, Table S2, Table S3. This material is available free of charge via the Internet at http://pubs.acs.org.

19 ACS Paragon Plus Environment

Analytical Chemistry GC-MS dataset

Page 20 of 25

Get a metabolome table (by MS-DIAL etc.)

Predicted RI

Ion abundance

Ion abundance

1 2 Step 1. Artifact exclusions by QC curve filtering 3 Peak X TRUE Peak Y FALSE 4 5 R² = 0.9972 R² = 0.2749 6 7 8 9 10 11 12 0 0 0.5 1 1.5 2 0.5 1 1.5 2 13 QC dilutions QC dilutions 14 15 16 17 Step 2. Variable selection by hypothesis testing 18 19 20 21 Step 3. Searching EI-MS spectral database 22 218 23 Search ~900,000 24 Unknown EI-MS records 25 100 382 26 27100 200 300 400 m/z 28 ... 29 218 218 30 31 Candidate A Candidate B C, D, ... 100 32100 33 34100 200 300 400 100 200 300 400 35 m/z m/z 36 37 38 Step 4. Retention index filtering by RI prediction 39 40 3000 41 Confidence Candidate A 42 2500 interval 43 44 2000 45 Candidate C 46 1500 47 48 Candidate B (highly suggested) 49 1000 50 1000 1500 2000 2500 3000 51 Experimental RI 52 53 ACS Paragon Plus Environment 54 55Identification by authentic standard experiment 56

a

Page 21 of 25

Ion abundance (×105)

30 g 40 g 50 g 60 g

4.0

R² = 0.9635

6.0

10 g 20 g

5.0

4.0

2.0

0.0 10

3.0

20

30

40

50

60

Volume (mg)

2.0

1.0

0.0

b

Ion abundance (×104)

0g

6.0

7.77

7.78

7.79 7.80 7.81 Retention time (min)

7.82

7.83

Alignment ID 497, RI 1490, Quant mass 86 8.0

0g

7.0

10 g 20 g

6.0 5.0

8.0 Ion abundance (×104)

Ion abundance (×105)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58

Analytical Chemistry

Alignment ID 378, RI 1356, Quant mass 188

30 g 40 g 50 g 60 g

6.0

2.0 0.0

4.0

R² = 0.0395

4.0

10

20

30

Volume (mg)

3.0 2.0 1.0 0.0 8.97

40

8.98

ACS Paragon Plus Environment

8.99 9.00 9.01 Retention time (min)

9.02

9.03

50

60

a

20 15

Analytical Chemistry Page 22 of 25 Score plot

PC2 15.8%

PC2 15.8%

1 10 2 3 5 4 0 5 6 -5 7 -10 CCD CJSD 8 LCSD CJD 9 -15 LCD CCSD 10-20 11 -30 -20 -10 0 10 20 12 PC1 29.4% 13 14 1 Loading plot b 15 16 170.5 18 19 20 0 21 22 23-0.5 24 25 26ACS -1 Paragon Plus Environment -0.5 0 0.5 1 27 -1 28 PC1 29.4%

R² = 0.93 SD=82

R² = 0.93 SD=78

R² = 0.93 SD=80

R² = 0.93 SD=88

50

Absolute stereo 40 30

Molecular skeleton (First layer of InChIKey)

20 10 0 0

100

200

RI tolerance

300

Glutamine

Protocatechuic acid

Taurine

4-Hydroxyphenylacetic acid

1,3-Diaminopropane

DOPA

Fumaric acid

Glutamic acid

Caffeine

beta-Alanine

Thymine

Leucine

Alanine

Ciliatine

30

Stearic acid

40

Nicotinic acid

GL Science

Spermine

RIKEN

trans-4-Hydroxy-proline

Tyrosine

Adenosine

Guanosine

Malonic acid

Pyridoxine

Oxalic acid

L-Norleucine

Ferulic acid

Putrescine

2-Oxoglutaric acid

Succinic acid

Lysine

Glutaric acid

Shikimic acid

Cystathionine

Arabitol

Ethanolamine

Aconitic acid

Quinolinic acid

a,d-Diaminopimelic acid

Pantothenic acid

4-Hydroxybenzoic acid

Tyramine

Glycine

Niacinamide

Sucrose

Asparagine

Aspartic acid

Phenylalanine

3,4-Dihydroxyphenylacetic acid

Methionine

d

Set 2-> Set 1

False discovery rate (%)

Set 1-> Set 2

Predicted Kovats RI

Predicted Kovats RI

c

GABA

Isoleucine

0 BCE

Inositol

3 BDE

Malic acid

Kovats RI difference from MassBank Osaka Univ. RI

600 1 Kazusa DNA Fiehn predicted Kovats RI Others 2 AD AE 7 500 7774 ade 3 4 ADE 0 Standard deviation (SD) 81 42 CE ABD 0 4 ACDE 85 AB E 35 25 400 6 ABDE 5 18 11 CDE 6 B 4 ABE 300 7 2 51 ABCDE DE 8 BE BCDE 0 70 200 2 15 9 ABCE 10 6 BD BCD BC 0 100 11 96 1 ABCD ABC 0 12 3 0 13 CD ACD AC D 1 1 1 14 C 30 -100 15 4 16 17 -200 A: Fiehn BinBase (1,021 records, 503 unique structures) 18 B: GL Sciences (494 records, 380 unique structures) 19 -300 C: Kazusa DNA (273 records, 150 unique structures) 20 D: Osaka Univ. (430 records, 330 unique structures) 21 E: RIKEN (241 records, 182 unique structures) 22 23 24 25 26 Compound name 27 28 29 30 31 3500 3500 32 33 127 identified spectra 34 3000 3000 35 36 37 Searching 2500 2500 38 by various RI tolerance 39 40 2000 2000 41 42 43 1500 44 1500 45 46 1000 47 1000 48 49 MassBank 500 500 50 51 500 1000 1500 2000 2500 3000 3500 500 1000 1500 2000 2500 3000 3500 13,570 spectra records 52 Experimental Kovats RI Experimental Kovats RI 53 Predicted RI = -495.2*ATSc1 + 29.8*topoDiameter ACS Paragon Plus Environment 54 55 + 101.4*MLFER_L + 27.9*ETA_Beta + 388.7 56

Glycerol

b

A 259

N-Acetyl-glucosamine

a

Analytical Chemistry

Sinapic acid

Page 23 of 25

1.0

64 candidates

2 candidates

5 candidates

3-Deoxyglucosone

Palatinitol

SERLAGPUMNYUCK-DCUALPFSSA-N OTMS OTMS

OTMS NOCH3

Predicted RI: 2894

231

147

150

231

200

250 m/z

300

50

350

400

103

147 319

0

319

50

100

ACS100 Paragon Plus Environment 100

Relative abundance

405 405

0

142

204

100

142 147 103

OTMS OTMS

Predicted RI: 1730

103

O

OTMS OTMS TMSO

OTMS

100

OTMS O

TMSO

NOCH3

TMSO

50

LCD

5 candidates

ZGCHLOWZNKRZSN-NTSWFWBYSA-N

50

LCSD

0.0

CCD

0.5

LCD

LCSD

CCD

CCSD

0.0

CJD

2.0

Page 24 of 25

Alignment ID: 1538 Retention index: 2875 Quant mass: 204

CCSD

4.0

?

1.5

CJD

Ion abundance (×105)

Alignment ID: 761 Retention index: 1740 Quant mass: 231

2.0

CJSD

?

6.0

CJSD

LCD

LCSD

CCD

CCSD

CJD

CJSD

1 1.0 2 3 0.5 4 5 0.0 6 7 8 9 Spectral searching 10 46 candidates (>80%) 11 12 13RI filtering (