Subscriber access provided by UNIV OF LOUISIANA
Article
Deep MS/MS-Aided Structural-similarity Scoring for Unknown Metabolites Identification Hongchao Ji, Yamei Xu, Hongmei Lu, and Zhimin Zhang Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.8b05405 • Publication Date (Web): 16 Apr 2019 Downloaded from http://pubs.acs.org on April 17, 2019
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
Deep MS/MS-Aided Structural-similarity Scoring for Unknown Metabolite Identification Hongchao Ji, Yamei Xu, Hongmei Lu*, Zhimin Zhang* College of Chemistry and Chemical Engineering, Central South University, Changsha 410083, PR China ABSTRACT: Tandem mass spectrometry (MS/MS) is the workhorse for structural annotation of metabolites, because it can provide abundance of structural information. Currently, metabolite identification mainly relies on querying experimental spectra against public or in-house spectral databases. The identification is severely limited by the available spectra in the databases. Although, the metabolome consists of a huge number of different functional metabolites, the whole metabolome derives from a limited number of initial metabolites via bioreactions. In each bioreaction, the reactant and the product often change some substructures but still structurally related. These structurally related metabolites often have related MS/MS spectra, which provide the possibility to identify unknown metabolites through known ones. However, it is challenging to explore the internal relationship between MS/MS spectra and structural similarity. In this study, we present the Deep-learning-based approach for MS/MS-Aided Structural-similarity Scoring (DeepMASS), which can score the structural similarity of unknown metabolite against the known one with MS/MS spectra and deep neural networks. We evaluated DeepMASS with leave-one-out cross-validation on MS/MS spectra of 662 compounds in KEGG and an external test on the biomarkers from male infertility study measured on Shimadzu LC-ESI-IT-TOF and Bruker Compact LC-ESI-QTOF. Results show that the identification of unknown compound is valid if its structure-related metabolite is available in the database. It provides an effective approach to extend the identification range of metabolites for existing MS/MS databases.
Tandem mass spectrometry (MS/MS) is widely used for metabolite identification, because the MS/MS spectra contain abundant substructure information of the metabolites. Current metabolite identification mainly relies on searching MS/MS spectra against the databases such as HMDB1, METLIN2, GNPS3 and MassBank. However, the number of available spectra is limited compared with the number of potential metabolites. Moreover, the MS/MS spectra of the same metabolite may vary with the type of mass spectrometers, collision energy and other factors (such as collision gas, residence time in the traps, ion-molecules reaction, ion-source contamination, ion optics, etc.4,5), which makes the identification even more challenging. In order to break through this limitation, large numbers of methods and tools emerged recently. They can be generally divided into three categories: (1). in-silico fragmentation methods include MetFrag6,7, CFM-ID8, MAGMa+9, MIDAS10,11 and MS-Finder12. (2). fingerprint-based methods include Finger-ID13, CSI-FingerID14 and IOKR version of CSI-FingerID15, which use the MS/MS spectrum to predict the molecular fingerprint for assisting identification. (3) known-to-unknown methods, which compare the MS/MS spectra of unknown metabolites with the relevant spectra of the reference metabolites, and predict their structural similarity for identification. Emma et al.16 summarized the result of Critical Assessment of Small Molecule Identification (CASMI) 2016. The participants included CSI-FingerID, MS-Finder, CFM-ID and MAGMa+.
CSI-FingerID had the best (34%) Top 1 result (correct candidate ranked in first place) followed by MS-Finder (22%) and CFM-ID (19%). However, CFM-ID had the most correct candidates in the Top 10 (59%), which is better than MS-Finder (48.5%) and CSI-FingerID (48.1%). Blaženović et al. tested the pure in-silico fragmentation performance of MetFrag, CFM-ID, MAGMa+ and MS-Finder with CASMI 2016 challenges. The result is that MetFrag has the best Top 10 performance (54.8%), followed by CFM-ID (54.5%) and MAGMa+ (48.4%)17. In this study, we focus on the methods from the third category. Although, the metabolome consists of a huge number of different functional metabolites, the metabolites basically originate from a limited number of initial metabolites via bioreactions. For example, Wang et al summarized 36 common biological transformation rules of KEGG. Those rules can be alone or in combined in a bioreaction18. For many reactions, the structure of the reactant changes partially. The reactant and the product still share some common substructures and are structurally related. Those structurally related metabolites will deliver their similarities to their MS/MS spectra19. It is possible to exploit those similarities to identify the structural related metabolites. The unknown metabolite can be identified based on the known ones. Therefore, the identification ability of the existing spectra database can be strengthened through this approach.
ACS Paragon Plus Environment
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 2 of 16
The structural-similarity has been applied to assist metabolite identification in previous works18,20–26. However, they are still not as integrated and mature as software of category 1 and category 2 referred before. In order to determine the structural similarity, we can compare the MS/MS spectra and find the MS/MS spectra are similar or not. Each MS/MS spectrum can be regarded as a vector of m/z - intensity pairs. Therefore, the dot-product, cosine similarity, model-based methods and rules-based method can be used to perform the comparisons. For example, MetDNA20,26 uses dot product of MS/MS spectra between two metabolites as criterion to estimate whether the two metabolites are neighbors in metabolic reaction network. iMet23 uses cosine similarity between the MS/MS spectra at various collision energies and mass difference of precursor ions as features, then builds a random forest model to predict whether the unknown metabolite can be transformed from a known metabolite. However, both dot product and cosine similarity can only capture simple linear relationship between MS/MS spectra, while the other information of substructure
has changed the machine learning community fundamentally27. Various architectures of deep neural networks emerged for different applications, including deep neural networks28, convolutional neural networks29, recurrent neural networks30, deep belief networks31 and generative adversarial networks32. Meanwhile, deep learning frameworks, such as Tensorflow33 and Keras34, enable researchers go from idea to result swiftly. Therefore, deep learning becomes revolutionary technology in many research fields quickly and achieved the state-of-the-art performance in metabolomics35,36 and proteomics37–40 and exploring structural relationships of compounds41–43.
transformation will be lost. MPEA18 summarized
METHODS
some common transformations from known
Overview of DeepMASS. The workflow of DeepMASS basically consists of three main steps (Figure 1), which are training DeepMASS model, predicting structural-similarity scores, retrieving and ranking the candidates. In the following sections, each main step will be descripted in details.
bioreactions. Then it takes the metabolites identified
with
confidence
as
the
seed
metabolites. Next, The MS/MS spectra are analyzed manually to confirm whether the unknown metabolites are the product of the seed metabolites via bioreaction. The criterion is the ratio of the fragmental ions can be explained of the unknown metabolites. These fragmental ions include IF (identical fragments), DF (deliver mass difference as the pseudomolecule ions) and NLF (product and precursor metabolites with identical neutral losses from the pseudo-molecular ions) ions. However, this method lacks of universality and automation. To improve the automation and generalization ability of the known-to-unknown methods, we need novel machine learning methods, which can learn features from raw data directly, has better generalization ability than rule-based approaches and can be easily scaled to take advantage of increasing MS/MS spectral database. Recently, deep learning revolution
Here, we present the Deep-learning-based approach for MS/MS-Aided Structural-similarity Scoring (DeepMASS) between the unknown metabolite and the known ones. The source code is available at https://github.com/hcji/DeepMASS. With the assistance from similar known metabolites predicted by DeepMASS, one can elucidate the possible structures of the unknown metabolites accordingly.
Train DeepMASS Model. The core of DeepMASS is to score the structural-similarity between metabolites based on their MS/MS spectra by deep neural network. The problem is defined as: MASS Score𝑎,𝑏 = 𝑓(𝑠𝑝𝑒𝑐𝑡𝑟𝑢𝑚𝑎, 𝑠𝑝𝑒𝑐𝑡𝑟𝑢𝑚𝑏)
(1)
Where a and b are two metabolites (denoted as metabolite pair), and 𝑠𝑝𝑒𝑐𝑡𝑟𝑢𝑚𝑎 and 𝑠𝑝𝑒𝑐𝑡𝑟𝑢𝑚𝑏 are their MS/MS spectra (denoted as spectra pair). If the spectra in spectra pair are obtained by mass spectrometers, it is defined as experimental spectra pair. If the spectra in spectra pair are generated by in-silico fragmentation tools, it is defined as theoretical spectra pair. The structural similarity between them can be measured by the Dice correlation of their molecular fingerprints (FP score)44 as the equation 2. FP Score𝑎,𝑏 = 𝐷𝑖𝑐𝑒(𝑓𝑖𝑛𝑔𝑒𝑟𝑝𝑟𝑖𝑛𝑡𝑎, 𝑓𝑖𝑛𝑔𝑒𝑟𝑝𝑟𝑖𝑛𝑡𝑏)
(2)
There are different types of fingerprints and correlation functions. In this work, we compared the performance of three types of fingerprints
ACS Paragon Plus Environment
Page 3 of 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
(Topological45, MACCS46 and Morgan47,48), and finally Morgan fingerprint is chosen for its best performance on the correlation between FP score and MASS score (see Result section). All procedures were implemented with Python version of RDKit (www.rdkit.org ). To obtain metabolite pairs with high structure similarity, we downloaded the list of reactant-product pairs from the KEGG database (named positive metabolite pair). Each pair includes the reactant and product of a known biochemical reaction. The generic and symbolic reactions are discarded. The remained positive metabolite pairs are listed in Table S1 (total 8439 pairs). Then, three times random metabolite pairs are generated as negative metabolite pairs with all the metabolites in KEGG, which are listed in Table S2. The positive metabolite pairs and negative metabolite pairs are merged to train and validate DeepMASS model. Experimental MS/MS spectra were cloned from the repository of MetDNA project, and the URL is https://github.com/omicschina/quBing. The MS/MS spectra of 752 KEGG compounds were included in this repository, and all of them are measured by Sciex TripleTOF. For each positive metabolite pair or negative metabolite pair, the spectra of both metabolites were searched in experimental spectra databases. If both were found, they were collected as experimental spectra pair. 2,827 experimental spectra pairs can be constructed from 752 experimental MS/MS spectra. Since the number of experimental MS/MS spectra is too limited to train DeepMASS model effectively, CFM-ID8 was used for generating the MS/MS spectral database of all KEGG compounds (total 15,300 spectra). Then, we collected the theoretical spectra pairs in the same way as experimental spectra pairs. and 32,341 theoretical spectra pairs have been constructed in this study (a few spectra pairs were discarded because the generating of the theoretical of several metabolite were failed). Theoretical spectra pairs are used to pretrain the deep neural network. Then, experimental spectra pairs are used to fine-tune the deep neural network. This pretrain and fine-tune technique is a feasible way to overcome the small number of experimental spectra pairs when training deep neural network. The DeepMASS model for structural-similarity scoring was built with Keras framework on top of Tensorflow backend with GPU acceleration. The architecture of the deep neural network is illustrated
in Figure 2. Raw MS/MS spectra were converted to sparse vectors with 0.01 accuracy. These sparse vectors of the metabolite pair as well as their fast Fourier transform cross correlations are combined as the input of the deep neural network, which is followed by a fully-connect layer with rectified linear unit (ReLU), a nonlinear activation function defined by: 𝑅𝑒𝐿𝑈(𝑥) = 𝑚𝑎𝑥(0,𝑥)
(3)
ReLU can alleviate gradient vanishing problems and train deeper neural networks effectively. The exact masses of the metabolite pair, their mass difference and the chemical formulas are concatenated as the second input, which is also followed by a fully-connect layer with ReLU activation function. Then the outputs are concatenated and followed by two fully-connect layer for structural-similarity scoring. After training, the DeepMASS model can predict MASS score for any experimental spectra pair. Predict MASS Scores. Given the MS/MS spectrum of the unknown metabolite, its experimental spectra pairs are built with the unknown metabolite with every MS/MS spectrum in the database (752 spectra in MetDNA project). The DeepMASS model can predict the MASS scores for all the experimental spectra pairs of this unknown metabolite. The metabolites with MASS scores over a threshold will be chosen as reference metabolite. The default threshold is 0.6. These metabolites may have similar structures with the unknown. Finally, all the MASS scores of these reference metabolites will be stored in MASS score vector. Retrieve and Rank Candidates. This procedure for retrieving and ranking candidates is as follow: (1) Determine the molecular formula. With the exact monoisotopic mass in MS, formula candidates can be retrieved in the refined formula database12 or generated with formula generator49. The theoretical isotope pattern of each formula is generated by IsoSpec50. The theoretical isotope patterns are compared to the experimental isotope pattern of the unknown metabolite to obtain the most possible formula of the unknown metabolite. (2) Retrieve the structure candidates. With the formula, the structure candidates can be retrieved from the comprehensive metabolome structure database12 or PubChem via the PUG REST web service51.
ACS Paragon Plus Environment
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
(3) Rank the structure candidates. Calculate the FP scores between each candidate and the reference metabolites. As a result, a vector of FP scores can be obtained for each candidate, and the length of each vector is the number of reference metabolites. Then, the dot product between the MASS score vector and FP score vector of each candidate is calculated. The higher value of the dot product means better consistency. All the structure candidates are ranked by the dot product values in descending order. 𝑛
𝐷𝑜𝑡 𝑃𝑟𝑜𝑑𝑢𝑐𝑡 = ∑ 𝑀𝐴𝑆𝑆 𝑆𝑐𝑜𝑟𝑒𝑖 × 𝐹𝑃 𝑆𝑐𝑜𝑟𝑒𝑖 𝑖 =1
(4)
EXPERIMENT Seminal Dataset. This dataset is from our previous male infertility. A brief introduction is included in the support information. The selected biomarkers were measured with Shimadzu LC-ESI-IT-TOF-MS instrument and Bruker LC-ESI-QTOF-MS instrument, respectively. These biomarkers were identified by comparing with the standard spectra from HMDB database. CASMI 2014 Dataset. The MS/MS spectra of the first 10 challenges from CASMI 2014 dataset, which were directly detected from human tissues, were used in this study. They were compounds from endothelial cell, urine, cerebrospinal fluid, serum, plasma of human and measured by Thermo Scientific Orbitrap Velos and Thermo Scientific Q-Exactive.
RESULTS Structural Similarities of KEGG Pairs. The basic hypothesis of DeepMASS is that reactant and product metabolites share similar substructures. If the unknown metabolite is the transformational product of known metabolites, there will be metabolites in the existing database which are similar to the unknown metabolite. To verify the hypothesis, we compared the structural similarities of positive metabolite pairs and negative metabolite pairs. Figure 3A shows the results. It can be seen that the structural similarities of positive metabolite pairs are significantly higher than the negative metabolite pairs. Figure 3B shows the receiver operating characteristic (ROC) curve by plotting the true positive rate against the false positive rate of binary classification based on the structural similarity. The area under curve (AUC) is 0.94, which indicates that structural similarity can well differentiate the positive metabolite pairs from the negative metabolite pairs. Therefore, we deduced that most of the transformational product metabolites have the similar structures of their reactant metabolites.
Page 4 of 16
Spectral Cross Correlation Between Metabolites. iMet and MetDNA use simple cosine value and dot product to evaluate the correlation of spectra, which can only capture simple linear relationship. In fact, simple linear relationship is only effective under limited conditions. Bio-transformation of metabolites often leads to deviation from the simple linear relationship. As shown in Figure 4, A1 shows the spectra of nicotinamide adenine dinucleotide phosphate (NADP) and nicotinamide adenine dinucleotide (NAD), respectively. The spectra are very similar because of the structural similarity, and main fragmental ions are identical. A2 shows the cross-correlation coefficient between the spectra. The maximum value comes when the m/z shift is 0. In this case, cosine value and dot product can describe the relationship of spectra. B1 shows the spectra of 5'-Phosphoribosylformylglycinamidine and aminoimidazole ribotide. Their structures are obviously similar, since the later originates from dehydration of the former metabolite. However, their spectra show only one identical fragmental ion. It means that the simple linear correlation is not suitable in this situation. The maximum cross-correlation coefficient comes at m/z shift = 18.01 Da, which matches the exact mass of H2O. Thus, DeepMASS takes both spectra pair and cross-correlation coefficients as inputs to train the model, which can achieve better structural-similarity scoring of metabolites. Accuracy of MASS Scores. In order to evaluate the scoring accuracy, 20% of the theoretical spectra pairs and experimental spectra pairs were excluded for testing. Figure 5 is the plot of MASS scores against FP scores of the excluded theoretical spectra pairs and experimental spectra pairs, respectively. In both of the plots, the FP scores of the points were fully distributed from 0 to 1, which means the used data are representative for the regression task. The MASS scores are very well correlated with FP scores, and each point is a metabolite pair, and most of the points are distributed near the diagonal line. The mean absolute errors of them are 0.073 and 0.082, which indicates the predicted structural similarities by DeepMASS are very similar to the real structural similarities of metabolite pairs. Particularly, the points with higher MASS scores are more concentrated to the diagonal line, which indicates a higher accuracy. This means it is reasonable to select the metabolites of higher MASS scores as reference metabolites for assisting identification. The mean absolute errors of the models trained with other types
ACS Paragon Plus Environment
Page 5 of 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
of fingerprints referred in the Method section are listed in Table S7. Cross Validation Test. To validate the identification performance of DeepMASS, we run a cross validation test. In this test, 662 spectra in the measured spectra database were selected, as their compound structures are included in the comprehensive metabolome structure database12 (321,616 metabolites in total, abbreviated as Metabolite Database). Each spectrum was treated as unknown once, and the rest were treated as known metabolites. We followed the steps described in previous sections. Since the MS information was unknown, we assumed the formulas were given and tested the structural identification performance of DeepMASS. 590 out of 662 spectra were found at least one metabolite with similar structure. The identification results based on the structure similarity are summarized in Table 1, and the details are listed in Table S3. In the 590 spectra, the percentage of ranking the correct structure as the top hit was 52.0%. The percentage that the correct structure was found among the top 3, top 5 and top 10 hits were 74.9%, 85.3% and 92.0%, respectively. In order to increase the challenge, we extended the candidates searching to the entire PubChem compound database without any filter criteria. The detailed results are listed in Table S4. Even searching in the entire PubChem database without any filters, DeepMASS also ranks the correct structure as the top 5 hits was 55.4%, and the percentage that the correct structure was found among the top 10 and top 20 hits was 62.4% and 72.4%. For comparison, we also used MetFrag and CFM-ID for the same identification goals, the details are in support information. The results show DeepMASS achieved better performance than MetFrag and CFM-ID in this cross-validation test with both Metabolite Database and PubChem database. External Data Test. The spectra of the selected biomarkers were processed by DeepMASS. In order to guarantee that the results reflect the identification performance of unknown metabolites, if the spectrum of the biomarker was already in the database, it was removed. For comparison, we also used MS-Finder, MAGMa, MetFrag and CFM-ID for the same identification goals. The parameter setting of the compared programs are in the support information. From the listed results in Table 2, It can be seen that DeepMASS perform much better than MetFrag, CFM-ID and MAGMa in all datasets. MS-Finder shows the similar performance to DeepMASS. However, MS-Finder relies on some prior information
like the citation of public database, DeepMASS does not rely on any prior information. Although prior information can make better performance in some cases, it also leads prejudice sometimes. The results indicate the predictions are reasonable even the spectra were generated by different types of instruments and different types of biological samples. For looking into the detailed basis of the candidate ranking of DeepMASS, we chose the seminal plasma biomarkers measured by LC-ESI-IT-TOF as an example and listed the top 4 reference metabolites and their FP scores in Table S5. It is obvious that the given reference metabolites have the similar structure with the true metabolites in most case. It means the reference metabolites can truly give some substructure information of the unknown metabolite, which explains why DeepMASS can rank the true metabolite as the top hit in most case even though the metabolite is not included in the training set. The true metabolite is ranked as the second hit for creatine and arginyl-valine. This is because the top candidate given by DeepMASS has very similar structure with the true metabolite. In order to demonstrate that DeepMASS is effective on mass spectrometers from different manufacturers. The MS/MS spectra of the first 10 challenges from CASMI 2014 dataset, which were directly detected from human tissues, has been used in this study. They were compounds from endothelial cell, urine, cerebrospinal fluid, serum, plasma of human and measured by Thermo Scientific Orbitrap Velos and Thermo Scientific Q-Exactive. The results are listed in Table S6. Results show that DeepMASS is effective on a variety of instruments. Evaluation the Confidence of Identification. Since the computational methods cannot always rank the true metabolite as the top hit, one important and challenging problem is how to evaluate the confidence of the putative identification52. For DeepMASS, the confidence evaluation is easy because of its simple and explicit theory. The criteria include follow: First, the number of reference metabolites and the FP scores between the reference metabolites and the structural candidates. Basically, the more reference metabolites and the higher FP scores indicate the higher confidence. The reason is that each structural related reference metabolite includes some substructure information of the unknown metabolite. The more reference metabolites, the more substructure information can be captured.
ACS Paragon Plus Environment
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Furthermore, the higher similarity value means that the substructure information is more convinced. Second, the dot product can be calculated between the MASS scores and the FP scores. Higher dot product means more confidence in the identification results. When the dot product between the MASS scores and FP scores is very low, it means the prediction of DeepMASS model is not accurate. In this condition, the result becomes not convinced. Table 3 shows two examples. Top 3 structure-related metabolites and the rank of the true metabolite are listed, respectively. For each reference metabolite, its FP score and MASS score are also listed. In the top row, the reference metabolites given by DeepMASS are more similar to the top candidate and the predicted MASS scores are more consistent with the FP scores. Therefore, the putative identification result is of high confidence since the consistency between the MASS scores and the FP scores are good. However, in the second row, the predicted MASS scores are lower, and the consistency between the MASS scores and the FP scores are worse, so the result is of low confidence. Limitation and Future Work. Since DeepMASS utilizes the structural similarities between the known metabolites and the unknown metabolite, the number and the quality of existed known metabolites and their MS/MS spectra are crucial for the accuracy of identification. Currently, DeepMASS built model based on the MS/MS spectral library with only 752 metabolites, but its accuracy of identification is impressive and comparable to the state-of-art methods. it provides a novel thought for extending the range of metabolites can be identified by the existing database. What restricts the accuracy of DeepMASS at present is the number of structural-related metabolite pairs and the database size of spectra for training. It can be improved by including more standard spectra and structural-related metabolite pairs, as DeepMASS model can be easily trained with larger dataset. In future, we will train larger-scale DeepMASS model by fusing MS/MS spectra from public databases like HMDB, METLIN, GNPS and MassBank. We will also summarize more structural-related metabolite pairs from other bioreaction databases, such as Rhea53, MetRxn54 and BKM-react55, for training a more accurate model. DeepMASS also provides open-sourced package to assist its users train their own model based on their in-house database.
CONCLUSION
Page 6 of 16
Metabolite identification is of great challenge in metabolomic study. One of the main reasons is the limited number of available MS/MS spectra in database. The structural similarity between metabolites is a promising concept to increase the number of identifiable metabolites of the existing databases. Despite some previous works have provided algorithms based on this concept, none of them is integrated enough to rank the potential structures automatically. Moreover, they cannot handle deviation from the simple linear relationship between spectra of structural related metabolites. Hence, we present the DeepMASS, which is an effective method to score the structural-similarities between the unknown metabolite and the metabolites in databases by their MS/MS spectra, and obtain the reference metabolites. It can also retrieve structural candidates from public compound databases and rank them with the assistance from the reference metabolites. Comparing with MetFrag, CFM-ID and MAGMa, DeepMASS shows better performance in metabolite identification without any prior information. Since the principle of DeepMASS is simple and the ranking criterion is straightforward, the users can evaluate the confidence of identification easily and assuredly. It provides both a novel method and an open-sourced tool for unknown metabolites identification. Since the MS/MS spectra of the same metabolite may vary with the type of mass spectrometers and collision energy, the users can also train their own model with the provided source code. Supporting information The detail of MetFrag comparison of cross validation
test ; Brief introduction of the infertility study; The details of comparison of external data test; Reason
for the selection of CFM-ID for generating the theoretical MS/MS spectra (PDF); Positive metabolite pairs used in this work (Table S1); Negative metabolite pairs used in this work (Table S2); Identification results of cross-validation test (Table S3); Details of cross-validation test with PubChem database (Table S4); Identification results of the infertility plasma biomarkers (Table S5); Identification results of CASMI 2014 dataset (Table S6); Benchmark of different types of molecular fingerprints (Table S7). AUTHOR INFORMATION Corresponding Author *E-mail:
[email protected];
[email protected]. Notes
ACS Paragon Plus Environment
Page 7 of 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
The authors declare no competing financial interest.
(17)
ACKNOWLEDGMENTS This work is financially supported by the National Natural Science Foundation of China (Grant Numbers. 21873116 and 21675174) References (1)
(2)
(3)
(4)
(5)
(6) (7)
(8)
(9)
(10) (11)
(12)
(13) (14)
(15) (16)
Wishart, D. S.; Feunang, Y. D.; Marcu, A.; Guo, A. C.; Liang, K.; Vázquez-Fresno, R.; Sajed, T.; Johnson, D.; Li, C.; Karu, N.; et al. HMDB 4.0: The Human Metabolome Database for 2018. Nucleic Acids Res. 2018, 46 (D1), D608–D617. Guijas, C.; Montenegro-Burke, J. R.; Domingo-Almenara, X.; Palermo, A.; Warth, B.; Hermann, G.; Koellensperger, G.; Huan, T.; Uritboonthai, W.; Aisporna, A. E.; et al. METLIN: A Technology Platform for Identifying Knowns and Unknowns. Anal. Chem. 2018, 90 (5), 3156–3164. Wang, M.; Carver, J. J.; Phelan, V. V.; Sanchez, L. M.; Garg, N.; Peng, Y.; Nguyen, D. D.; Watrous, J.; Kapono, C. A.; Luzzatto-Knaan, T.; et al. Sharing and Community Curation of Mass Spectrometry Data with Global Natural Products Social Molecular Networking. Nat. Biotechnol. 2016, 34 (8), 828–837. Bazsó, F. L.; Ozohanics, O.; Schlosser, G.; Ludányi, K.; Vékey, K.; Drahos, L. Quantitative Comparison of Tandem Mass Spectra Obtained on Various Instruments. J. Am. Soc. Mass Spectrom. 2016, 27 (8), 1357–1365. Dubey, R.; Hill, D. W.; Lai, S.; Chen, M.-H.; Grant, D. F. Correction of Precursor and Product Ion Relative Abundances in Order to Standardize CID Spectra and Improve Ecom50 Accuracy for Non-Targeted Metabolomics. Metabolomics 2015, 11 (3), 753–763. Wolf, S.; Schmidt, S.; Müller-Hannemann, M.; Neumann, S. In Silico Fragmentation for Computer Assisted Identification of Metabolite Mass Spectra. BMC Bioinformatics 2010, 11. Ruttkies, C.; Schymanski, E. L.; Wolf, S.; Hollender, J.; Neumann, S. MetFrag Relaunched: Incorporating Strategies beyond in Silico Fragmentation. J. Cheminform. 2016, 8 (1), 1– 16. Allen, F.; Pon, A.; Wilson, M.; Greiner, R.; Wishart, D. CFM-ID: A Web Server for Annotation, Spectrum Prediction and Metabolite Identification from Tandem Mass Spectra. Nucleic Acids Res. 2014, 42 (W1), 94–99. Ridder, L.; van der Hooft, J. J. J.; Verhoeven, S. Automatic Compound Annotation from Mass Spectrometry Data Using MAGMa. Mass Spectrom. 2014, 3 (Special_Issue_2), S0033– S0033. Wang, Y.; Kora, G.; Bowen, B. P.; Pan, C. MIDAS: A Database-Searching Algorithm for Metabolite Identification in Metabolomics. Anal. Chem. 2014, 86 (19), 9496–9503. Wang, Y.; Wang, X.; Zeng, X. MIDAS-G: A Computational Platform for Investigating Fragmentation Rules of Tandem Mass Spectrometry in Metabolomics. Metabolomics 2017, 13 (10), 116. Tsugawa, H.; Kind, T.; Nakabayashi, R.; Yukihira, D.; Tanaka, W.; Cajka, T.; Saito, K.; Fiehn, O.; Arita, M. Hydrogen Rearrangement Rules: Computational MS/MS Fragmentation and Structure Elucidation Using MS-FINDER Software. Anal. Chem. 2016, 88 (16), 7946–7958. Heinonen, M.; Shen, H.; Zamboni, N.; Rousu, J. Metabolite Identification and Molecular Fingerprint Prediction through Machine Learning. Bioinformatics 2012, 28 (18), 2333–2341. Dührkop, K.; Shen, H.; Meusel, M.; Rousu, J.; Böcker, S. Searching Molecular Structure Databases with Tandem Mass Spectra Using CSI:FingerID. Proc. Natl. Acad. Sci. 2015, 112 (41), 12580–12585. Brouard, C.; Shen, H.; Dührkop, K.; D’Alché-Buc, F.; Böcker, S.; Rousu, J. Fast Metabolite Identification with Input Output Kernel Regression. Bioinformatics 2016, 32 (12), i28–i36. Schymanski, E. L.; Ruttkies, C.; Krauss, M.; Brouard, C.; Kind, T.; Dührkop, K.; Allen, F.; Vaniya, A.; Verdegem, D.; Böcker, S.; et al. Critical Assessment of Small Molecule Identification 2016: Automated Methods. J. Cheminform. 2017, 9 (1), 1–21.
(18)
(19)
(20)
(21)
(22)
(23)
(24)
(25)
(26)
(27) (28) (29)
(30) (31) (32)
(33)
(34) (35)
Blaženović, I.; Kind, T.; Torbašinović, H.; Obrenović, S.; Mehta, S. S.; Tsugawa, H.; Wermuth, T.; Schauer, N.; Jahn, M.; Biedendieck, R.; et al. Comprehensive Comparison of in Silico MS/MS Fragmentation Tools of the CASMI Contest: Database Boosting Is Needed to Achieve 93% Accuracy. Journal of Cheminformatics. 2017. Wang, L.; Ye, H.; Sun, D.; Meng, T.; Cao, L.; Wu, M.; Zhao, M.; Wang, Y.; Chen, B.; Xu, X.; et al. Metabolic Pathway Extension Approach for Metabolomic Biomarker Identification. Anal. Chem. 2017, 89 (2), 1229–1237. Schollée, J. E.; Schymanski, E. L.; Stravs, M. A.; Gulde, R.; Thomaidis, N. S.; Hollender, J. Similarity of High-Resolution Tandem Mass Spectrometry Spectra of Structurally Related Micropollutants and Transformation Products. J. Am. Soc. Mass Spectrom. 2017, 28 (12), 2692–2704. Shen, X.; Xiong, X.; Wang, R.; Yin, Y.; Cai, Y.; Ma, Z.; Liu, N.; Zhu, Z. Metabolic Reaction Network-Based Recursive Metabolite Identification for Untargeted Metabolomics. bioRxiv 2018. Huan, T.; Tang, C.; Li, R.; Shi, Y.; Lin, G.; Li, L. MyCompoundID MS/MS Search: Metabolite Identification Using a Library of Predicted Fragment-Ion-Spectra of 383,830 Possible Human Metabolites. Anal. Chem. 2015, 87 (20), 10619–10626. Li, L.; Li, R.; Zhou, J.; Zuniga, A.; Stanislaus, A. E.; Wu, Y.; Huan, T.; Zheng, J.; Shi, Y.; Wishart, D. S.; et al. MyCompoundID: Using an Evidence-Based Metabolome Library for Metabolite Identification. Anal. Chem. 2013, 85 (6), 3401–3408. Aguilar-Mogas, A.; Sales-Pardo, M.; Navarro, M.; Guimerà, R.; Yanes, O. IMet: A Network-Based Computational Tool to Assist in the Annotation of Metabolites from Tandem Mass Spectra. Anal. Chem. 2017, 89 (6), 3474–3482. van der Hooft, J. J. J.; Padmanabhan, S.; Burgess, K. E. V.; Barrett, M. P. Urinary Antihypertensive Drug Metabolite Screening Using Molecular Networking Coupled to High-Resolution Mass Spectrometry Fragmentation. Metabolomics 2016, 12 (7), 1–15. Benton, H. P.; Wong, D. M.; Trauger, S. A.; Siuzdak, G. XCMS2: Processing Tandem Mass Spectrometry Data for Metabolite Identification and Structural Characterization. Anal. Chem. 2008, 80 (16), 6382–6389. Shen, X.; Wang, R.; Xiong, X.; Yin, Y.; Cai, Y.; Ma, Z.; Liu, N.; Zhu, Z.-J. Metabolic Reaction Network-Based Recursive Metabolite Annotation for Untargeted Metabolomics. Nat. Commun. 2019, 10 (1), 1516. Lecun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521 (7553), 436–444. Park, C.; Lee, S. B.; An, K. H. Why Organizations Should Develop Its Creative Ability? Validation of Creative Thinking Process for Trading Firms. Inf. 2017, 20 (2), 789–818. LeCun, Y.; Boser, B.; Denker, J. S.; Henderson, D.; Howard, R. E.; Hubbard, W.; Jackel, L. D. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Comput. 1989, 1 (4), 541–551. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. In Neural Computation; Springer Berlin Heidelberg: Berlin, Heidelberg, 1997; Vol. 9, pp 1735–1780. Hinton, G. E.; Osindero, S.; Teh, Y. W. A Fast Learning Algorithm for Deep Belief Nets. Neural Comput. 2006, 18 (7), 1527–1554. Putin, E.; Asadulaev, A.; Vanhaelen, Q.; Ivanenkov, Y.; Aladinskaya, A. V.; Aliper, A.; Zhavoronkov, A. Adversarial Threshold Neural Computer for Molecular de Novo Design. Mol. Pharm. 2018, 15 (10), 4386–4397. Schapiro, A. C.; Rogers, T. T.; Cordova, N. I.; Turk-Browne, N. B.; Botvinick, M. M. Neural Representations of Events Arise from Temporal Community Structure. Nat. Neurosci. 2013, 16 (4), 486–492. Ketkar, N. Deep Learning with Python; Apress: Berkeley, CA, 2017. Asakura, T.; Date, Y.; Kikuchi, J. Application of Ensemble Deep Neural Network to Metabolomics Studies. Anal. Chim. Acta 2018, 1037, 230–236.
ACS Paragon Plus Environment
Analytical Chemistry (36)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
(37) (38)
(39)
(40) (41) (42) (43)
(44) (45)
(46)
Date, Y.; Kikuchi, J. Application of a Deep Neural Network to Metabolomics Studies and Its Performance in Determining Important Variables. Anal. Chem. 2018, 90 (3), 1805–1810. Tran, N. H.; Zhang, X.; Li, M. Deep Omics. Proteomics 2018, 18 (2), 1–5. Zhou, X. X.; Zeng, W. F.; Chi, H.; Luo, C.; Liu, C.; Zhan, J.; He, S. M.; Zhang, Z. PDeep: Predicting MS/MS Spectra of Peptides with Deep Learning. Anal. Chem. 2017, 89 (23), 12690–12697. Behrmann, J.; Etmann, C.; Boskamp, T.; Casadonte, R.; Kriegsmann, J.; Maaß, P. Deep Learning for Tumor Classification in Imaging Mass Spectrometry. Bioinformatics 2018, 34 (7), 1215–1223. Li, Y.; Wang, S.; Umarov, R.; Xie, B.; Fan, M.; Li, L.; Gao, X. DEEPre: Sequence-Based Enzyme EC Number Prediction by Deep Learning. Bioinformatics 2018, 34 (5), 760–769. Coley, C. W.; Rogers, L.; Green, W. H.; Jensen, K. F. Computer-Assisted Retrosynthesis Based on Molecular Similarity. ACS Cent. Sci. 2017, 3 (12), 1237–1245. Wei, J. N.; Duvenaud, D.; Aspuru-Guzik, A. Neural Networks for the Prediction of Organic Chemistry Reactions. ACS Cent. Sci. 2016, 2 (10), 725–732. Liu, B.; Ramsundar, B.; Kawthekar, P.; Shi, J.; Gomes, J.; Luu Nguyen, Q.; Ho, S.; Sloane, J.; Wender, P.; Pande, V. Retrosynthetic Reaction Prediction Using Neural Sequence-to-Sequence Models. ACS Cent. Sci. 2017, 3 (10), 1103–1113. Giri, V.; Sivakumar, T. V.; Cho, K. M.; Kim, T. Y.; Bhaduri, A. RxnSim: A Tool to Compare Biochemical Reactions. Bioinformatics 2015, 31 (22), 3712–3714. Nilakantan, R.; Bauman, N.; Venkataraghavan, R.; Dixon, J. S. Topological Torsion: A New Molecular Descriptor for SAR Applications. Comparison with Other Descriptors. J. Chem. Inf. Comput. Sci. 1987, 27 (2), 82–85. Faulon, J.-L.; Visco, D. P.; Pophale, R. S. The Signature Molecular Descriptor. 1. Using Extended Valence Sequences in
(47) (48) (49)
(50) (51)
(52) (53)
(54)
(55)
Page 8 of 16 QSAR and QSPR Studies. J. Chem. Inf. Comput. Sci. 2003, 43 (3), 707–720. Morgan, H. L. The Generation of a Unique Machine Description for Chemical Structures-A Technique Developed at Chemical Abstracts Service. J. Chem. Doc. 1965, 5 (2), 107–113. Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50 (5), 742–754. Zhang, M.; Zhang, Z.; Chen, C.; Lu, H.; Liang, Y. Parallel Formula Generator Based on Branch-and-Bound Algorithm for Elucidating High Resolution Mass Spectra. Chemom. Intell. Lab. Syst. 2016, 153, 106–109. Łacki, M. K.; Startek, M.; Valkenborg, D.; Gambin, A. IsoSpec: Hyperfast Fine Structure Calculator. Anal. Chem. 2017, 89 (6), 3272–3277. Kim, S.; Thiessen, P. A.; Bolton, E. E.; Bryant, S. H. PUG-SOAP and PUG-REST: Web Services for Programmatic Access to Chemical Information in PubChem. Nucleic Acids Res. 2015, 43 (W1), W605–W611. Böcker, S. Searching Molecular Structure Databases Using Tandem MS Data: Are We There Yet? Current Opinion in Chemical Biology. 2017, pp 1–6. Alcántara, R.; Axelsen, K. B.; Morgat, A.; Belda, E.; Coudert, E.; Bridge, A.; Cao, H.; De Matos, P.; Ennis, M.; Turner, S.; et al. Rhea - A Manually Curated Resource of Biochemical Reactions. Nucleic Acids Res. 2012, 40 (D1), 754–760. Kumar, A.; Suthers, P. F.; Maranas, C. D. MetRxn: A Knowledgebase of Metabolites and Reactions Spanning Metabolic Models and Databases. BMC Bioinformatics 2012, 13 (1), 1–13. Lang, M.; Stelzer, M.; Schomburg, D. BKM-React, an Integrated Biochemical Reaction Database. BMC Biochem. 2011, 12 (1), 42.
ACS Paragon Plus Environment
Page 9 of 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
Table 1. Identification results on the cross-validation test. Metabolite Database Rank Top 1 Top 5 Top 10 Top 20
PubChem
DeepMASS
MetFrag
CFM-ID
DeepMASS
MetFrag
CFM-ID
52.00% 85.30% 92.00% 95.60%
40.60% 72.80% 81.90% 90.50%
48.64% 72.20% 82.03% 90.68%
29.50% 55.40% 62.40% 72.40%
7.60% 19.20% 25.10% 35.60%
9.84% 23.4% 31.4% 43.6%
Table 2. Identification results on the external data test. Instrument
IT-TOF
Q-TOF
Metabolite
DeepMASS
MS-Finder
MAGMa
MetFrag
CFM-ID
Arginine Creatine Histidine Lysine
1 2 1 1
1 1 1 1
1 3 1 2
3 4 1 1
1 1 1 7
Phenylalanine
1
2
16
7
2
Tyrosine
1
1
3
2
3
Arginyl-Valine
2
2
1
2
2
Arginine Creatine Histidine Lysine
1 2 1 1
1 1 1 1
1 1 1 5
1 2 3 4
4 1 1 1
Phenylalanine
1
1
6
2
5
Tyrosine
1
1
1
3
1
Arginyl-Valine
2
3
2
3
2
ACS Paragon Plus Environment
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 10 of 16
Table 3. Evaluation the confidence of putative identification results of DeepMASS Top candidate
Structure
Structure related metabolites (Top 3)
True metabolite
ATP
ATP
(1/7) 0.658/0.789
0.807/0.786
Diethanolamine (5/6) 0.212/0.600
High
0.935/0.785
Reduced Threonine 0.139/0.615
Confidence
0.187/0.587
ACS Paragon Plus Environment
Low
Page 11 of 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
For TOC only
11 ACS Paragon Plus Environment
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 12 of 16
Figure 1: Workflow of DeepMASS method. MS/MS spectra are retrieved from database and are divided into spectra pairs. Then, spectra pairs are transformed into vectors. Thereafter, a deep neural network is trained for scoring structure similarities based on the spectra. For a MS/MS of unknown, it will be paired with all known spectra and the similarities between the unknown and the known metabolites will be predicted (MASS score). Meanwhile, structure candidates are retrieved from structure database. The structure similarities are calculated by fingerprints correlation (FP score) between each candidate and the known metabolites. The dot products can be calculated with the FP scores and the MASS scores. The structural candidates are ranked by their dot products in descending order.
12 ACS Paragon Plus Environment
Page 13 of 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
Figure 2: Architecture of the deep neural network of DeepMASS
13 ACS Paragon Plus Environment
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 14 of 16
Figure 3: (A) Box-whisker plots comparing the FP scores of positive metabolite pairs and negative metabolite pairs; (B) Receiver operating characteristic (ROC) curve for binary classification of positive metabolite pairs and negative metabolite pairs based on their FP score.
14 ACS Paragon Plus Environment
Page 15 of 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
Figure 4: MS/MS spectra of metabolites with similar structures (A1, B1) and their cross-correlation coefficients (A2, B2). In A1 and A2, the spectra of NAD and NADP show simple linear correlation relationship, as the structures of the two metabolites are nearly the same except a phosphonic acid functional groups. The cross-correlation coefficient reaches its max at m/z shift = 0. In B1 and B2, there is deviation (m/z shift = 18.01 Da) from simple linear correlation in MS/MS spectra, which reflect the dehydration cyclization. Therefore, the cross correlation achieves its max at m/z shift = 18.01 Da. Thus, cross correlation is the better way to capture the correlation relationship between metabolites pair.
15 ACS Paragon Plus Environment
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 16 of 16
Figure 5: Plot MASS scores against FP scores of metabolite pairs. The left plot is the results of generated spectra by CFM-ID. The right plot is the results of experimental spectra. Both are from the split test data for evaluating the trained model. The high correlation between the FP score and MASS score means the model can well predict the structural-similarity between metabolites.
16 ACS Paragon Plus Environment