Subscriber access provided by RMIT University Library
Article
Evaluation of an Artificial Neural Network Retention Index Model for Chemical Structure Identification in Non-Targeted Metabolomics Milinda A. Samaraweera, L. Mark Hall, Dennis W Hill, and David F. Grant Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.8b03118 • Publication Date (Web): 15 Oct 2018 Downloaded from http://pubs.acs.org on October 24, 2018
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
Evaluation of an Artificial Neural Network Retention Index Model for Chemical Structure Identification in Non-Targeted Metabolomics Milinda A. Samaraweera†, L. Mark Hall‡, Dennis W. Hill†, David F. Grant*, † †Department
of Pharmaceutical Sciences, University of Connecticut, 69 N Eagleville Road, Storrs, Connecticut 06269, United States ‡Hall Associates Consulting, 2 Davis Street, Quincy, Massachusetts 02170, United States ABSTRACT: Liquid chromatography coupled with electrospray ionization tandem mass spectrometry (LC-ESI-MS/MS) is a major analytical technique used for non-targeted identification of metabolites in biological fluids. Typically, in LC-ESI-MS/MS based database assisted structure elucidation pipelines, the exact mass of an unknown compound is used to mine a chemical structure database to acquire an initial set of possible candidates. Subsequent matching of the collision induced dissociation (CID) spectrum of the unknown to the CID spectra of candidate structures facilitates identification. However, this approach often fails because of the large numbers of potential candidates (i.e. false positives) for which CID spectra are not available. To overcome this problem, CID fragmentation predication programs have been developed, but these also have limited success if large numbers of isomers with similar CID spectra are present in the candidate set. In this study, we investigated the use of a retention index (RI) predictive model as an orthogonal method to help improve identification rates. The model was used to eliminate candidate structures whose predicted RI values differed significantly from the experimentally determined RI value of the unknown compound. We tested this approach using a set of ninety-one endogenous metabolites and four in-silico CID fragmentation algorithms: CFM-ID, CSI:FingerID, Mass Frontier, and MetFrag. Candidate sets obtained from PubChem and the Human Metabolite Database (HMDB) were ranked with and without RI filtering followed by in-silico spectral matching. Upon RI filtering, twelve of the ninety-one metabolites were eliminated from their respective candidate sets; i.e. were scored incorrectly as negatives. For the remaining seventy-nine compounds, we show that RI filtering eliminated an average of 58% from PubChem candidate sets. This resulted in an approximately 2-fold improvement in average rankings when using CFM-ID, Mass Frontier, and MetFrag. In addition, RI filtering slightly increased the occurrence of number one rankings for all 4 fragmentation algorithms. However, RI filtering did not significantly improve average rankings when HMDB was used as the candidate database, nor did it significantly improve average rankings when using CSI:FingerID. Overall, we show that the current RI model incorrectly eliminated more true positives (12) than were expected (4-5) based on the filtering method. However, it slightly improved the number of correct first place rankings and improved overall average rankings when using CFM-ID, Mass Frontier, and MetFrag.
The composition and concentrations of small molecule metabolites (201000 Da) in living organisms frequently change over time and represent the biochemical phenotype of an individual. These changes can be due to multiple factors including diet, time of day, environmental exposures, disease, drug exposures, genetic manipulation, gender and age.1-2 Nontargeted metabolomics is the unbiased quantification and identification of these metabolites in biological samples.3 In non-targeted metabolomics, researchers often utilize liquid chromatography coupled with electrospray ionization mass spectrometry (LC-ESI-MS) as the major analytical technique to separate and accurately measure the precursor masses of thousands of metabolites present in biological samples.4 However, to elucidate the chemical structure of these compounds, tandem mass spectrometry (LC-ESI-MS/MS) is often used.5 In LC-ESI-MS/MS, an experimental collision induced dissociation (CID) spectrum of an isolated precursor ion is used as a fingerprint in matching against a collection of reference CID spectra of known compounds in spectral libraries (e.g. MassBank, Metlin).6-8 Unfortunately, this approach often fails due to the dependency of CID spectral profiles on experimental conditions, and lack of coverage of chemical space that pertains to endogenous human metabolites within existing spectral libraries.6, 9-11 To overcome this limitation, computational fragmentation software (predictive fragmenters) has been developed with the aim of predicting experimental tandem mass spectral (MS/MS) profiles and the chemical
structure of ensuing predictive fragments10, 12 Working principles of these predictive fragmenters are described elsewhere, thus, only a brief overview is given here.13 Commercial predictive fragmenters such as ACD/MS fragmenter (Advanced Chemistry Development Labs, www.acdlabs.com) and MassFroniter (Thermo Scientific, www.thermoscientific.com) rely on general ionization fragmentation and rearrangement rules along with fragmentation schemes collected from the literature to predict the chemical structures of energy induced fragment ions generated from precusor ions of specific structural composition. (https://tools.thermofisher.com).14 These tools are extremely helpful in aiding manual spectral interpretation, which can be labor-intensive.15 However, cost, lack of automated candidate retrieval, and ranking protocols limit their use in high-throughput metabolomics pipelines.16-17 In contrast, predictive fragmenters such as MetFrag and MAGMa, attempt to explain ion peaks in an experimental CID spectrum by systematic dissociation of all the bonds in a given molecule. In other words, these predictive fragmenters compute all possible fragments of a molecule and then compare the mass of these fragments with the m/z values of fragments in an experimental CID spectrum. Compounds are ranked by assigning a score which is a function of ion peak intensity and the number of peaks explained in the experimental CID spectrum, bond dissociation energies and neutral losses to account for rearrangements.10, 18-
ACS Paragon Plus Environment
Analytical Chemistry Free availability, automated workflows and faster processing times makes these programs popular among metabolomic researchers.20 Machine learning (ML) is one of the most rapidly growing areas in computer science.21 ML involves the development of computer algorithms that learn from example data or past experience to solve or predict the outcome of an unfamiliar problem.22 Predictive fragmenters such as CFMID and CSI:FingerID have been developed based on ML paradigms.20, 23 CFM-ID23 utilizes a stochastic, generative Markov model trained using the CID mass spectral profiles of approximately 3500 metabolites randomly chosen from the Metlin database.23 This method allows for the prediction of CID spectral profiles and can also be used to rank candidates based solely on the similarity between predicted and experimental spectra.23 CSI:FingerID is based on fragmentation trees and kernel-based support vector machines trained to predict molecular structural features from CID spectra.20, 24-25 The predicted set of molecular features (or a fingerprint) is used to rank candidates based on maximum likelihood considerations and Platt probabilities to refine the fingerprint similarity scoring.20 It is important to note that training data used in ML methods have a significant influence on identification quality. Generally, ML based predictive fragmenters outperform other competing predictive fragmenters but have longer processing times.13, 20, 26 Upon completion of an ESI-LC-MS/MS analysis, measured experimental features such as monoisotopic mass (MIM), retention time and CID spectra are often utilized by researchers to elucidate the chemical structure of an unknown compound (peak) of interest. In database assisted structure elucidation pipelines, the first step is the acquisition of candidate structures for the unknown by matching the measured MIM to compounds in an all-purpose chemical database such as PubChem or ChemSpider, or specialized biological databases such as the human metabolite database (HMDB) or Kyoto Encyclopedia of Genes and Genomes (KEGG).13, 27-28 According to the critical assessment of small molecule identification (CASMI) contests,26, 29 the use of specialized databases improves the chances of correctly identifying the unknown structure. However, a major disadvantage of this approach is the inherent incompleteness of such databases. If the unknown compound is not contained in the database it cannot be correctly identified.30 Thus, it can be argued that there are advantages in mining large chemical databases such as PubChem (which currently contains more than 90 million compounds) or ChemSpider (which currently contains more than 59 million compounds) because of their much larger size and the inclusion of more diverse compound classes. Nonetheless, mining such databases may also result in large candidate sets.31-32 For example, searching for 5-hydroxytryptophan (MIM =220.0848 Da, ± 5 ppm) using the MetFrag web interface (http://msbi.ipbhalle.de/MetFragBeta/) yields 8223 candidates from PubChem and 3777 from Chemspider. The same search resulted in 3 candidates from HMDB and 4 candidates from KEGG. Generating the candidate list from large databases increases the likelihood that the unknown will be included in the candidate list, but also dramatically increases the number of false positives. To address this problem, we have developed a software package called MolFind,33 which relies on a set of orthogonal experimental features acquired from LC-ESIMS/MS experiments. These include Retention Index (RI), Drift Index (DI) and Ecom50 (collision gas normalized energy required to fragment 50% of precursor ions). For each experimental feature, MolFind eliminates false positives by comparing the experimental value of the unknown to a value predicted for each candidate compound using a computational model. A candiate compound is excluded when a predicted value for the candidate is substantially different from the experimental value of the unknown.33 In previous studies, when both RI and Ecom50 filters were applied concurrently with MetFrag, ranking of the correct compound improved from 142 to 102 on average (over 35 sets).33 More importantly, it was suggested that enhancements in the accuracy of the RI and Ecom50 models, could lead to removal of 87.2% of candidates on average, attaining a potential increase in an average MetFrag ranking of 15.5.33 19
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 2 of 9
The robustness of modeling approaches in predicting experimental features is heavily dependent on the training set. The models used in our earlier study30 were far from optimal, as the Ecom50 model and the RI models were trained using 54 and 400 compounds, respectively, having 99.5% confidence intervals of 2.1 eV and 114 RI units (RIU), respectively.30 Recently, we have improved our RI model by training with a diverse set of synthetic chemicals (1955 in total) covering diverse chemical classes representative of endogenous human metabolites.34 Confirmed endogenous human metabolites were deliberately excluded from the model data set in the previous study. A total of 202 confirmed metabolites were set aside as an independent validation set to test if a model based exclusively on relatively simple synthetic compounds could be used to make predictions for more complex metabolites.34 For the model used in this present study, the 202 independent validation compounds were reintroduced and the model was rebuilt using the same descriptors and learning protocol as the previous study resulting on a model based on 2157 compounds. This was done to take advantage of the available data on confirmed metabolites and expand the model applicability domain. As an extension to our previous work, the present study investigates the improvement in identification quality by enrichment of candidate sets using the expanded 2157 compoound RI model in conjunction with four different fragmentation algorithms: CFM-ID, CSI-FingerID, Mass Frontier and MetFrag. Candidate compounds were taken from both a large (PubChem) and small (HMDB) database.
EXPRIMENTAL SECTION Reagents and Chemicals. Acetonitrile (HPLC, gradient grade) and methanol (HPLC grade) were purchased from Sigma-Aldrich (St Louis, MO, USA). Water (18.2 MΩ.cm) used for the UPLC mobile phase and sample preparation. Reagent grade water was generated on a Burnstead Nanopure Diamond system (Thermo Scientific, Ward Hill, MA, USA). Heptaflurobutyric acid (HPLC grade) was purchased from Thermo Fisher Scientific Chemicals Inc. (Ward Hill, MA, USA). n-Propionamide, nbutanamide, and n-hexanamide were ordered from Aldrich (St Louis, MO, USA). n-Pentanamide was ordered from MP Biochemicals, LLC (Solon, OH, USA). A series of n-C7-C14 amides were synthesized as described in Supporting Information (SI) SI-1. The 91 test compounds and the controls used in the study were purchased from various sources and the vendor information is summarized in SI Table S2. HPLC grade formic acid (98%100%) was purchased from EMD Millipore Corporation (Billerica, MA, USA). Sample Preparation. Two different approaches were followed for the sample preparation. As the chemicals ordered from IROA technologies were contained in plates of polypropylene wells containing 5 ug of each chemical, 100 uL of solvent (either 0.1% formic acid in water, 0.05% formic acid in water:methanol (1:1) (v:v) or methanol) based on the XLOGP3 (taken from PubChem) value, were added to each well.The plates were covered with sealing tape to prevent evaporation. Dissolution was achieved by shaking wells on an Innova 2100 platform shaker (New Brunswick, CT, USA) for 45 minutes. Finally, the dissolved chemicals were transferred to 2 mL HPLC vials with micro volume glass inserts (Thermo Fisher Scientific, Ward Hill, MA, USA), sealed with Teflon septum caps and used directly for UPLC analysis. Stock solutions of all other chemicals were prepared at 1-10 umol/mL concentrations in the appropriate solvent based on the analyte’s XLOGP3 value as described above. The prepared stock solutions were further diluted at appropriate concentrations and used for the UPLC analysis. UPLC-ESI-MS/MS. Retention Index values were measured on a Zorbax, SB-C18, 2.1 mm x 150 mm, 1.8 um column (Agilent Technologies, Santa Clara, CA, USA) using an Acquity UPLC liquid chromatographic system (Waters, Milford, MA, USA). Solvent A was 0.766 mM heptafluoroacetic acid (HFBA) in water and solvent B was 0.766 mM HFBA in 10% water/acetonitrile (v/v). Compounds were eluted from the column using a solvent program consisting of a 4 min isocratic hold of 2%
ACS Paragon Plus Environment
2
Page 3 of 9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry solvent B followed by a 20 min linear gradient to 100% solvent B and a 5 min isocratic hold at a flow rate of 388 uL/min. The RI model used in this study was trained and validated using retention data for compounds analyzed on an Agilent 1100 capillary HPLC system (Agilent Technologies, Santa Clara, CA, USA). Thus, the protocols used to transfer and validate the LC method to the Acquity UPLC system are summerized in SI, SI-2. The outlet of the UPLC system was connected to the electrospray (ESI) ionization source of a Synapt G2-Si mass spectrometer (Waters, Milford, MA, USA) operating in the positive ion mode. A solution of leucine enkephalin (556.2771 Da, 400 pg/uL) in 0.1% (v/v) formic acid /methanol:water (1:1) was infused as the lock mass reference compound at a flow rate of 5 uL/min. The retention time of the test compounds were measured in duplicate using detection parameters described in SI-2. The mass range of the detector was set to 20-1000 Da. CID was carried out with nitrogen as the collision gas and the collision energy was varied from 0-30 eV (30-60 eV if required) in incremental steps of 2 eV at a scan rate of 12.5 scans/sec. Retention Index Measurements. RI values were measured based on a method developed by Hill et al.35 At the beginning and the end of each run, 1 uL of a homologues series n-C3-C14 amides were injected on the described UPLC system. The average retention times (RT) of the individual n-amides were used as the calibration reference for calculating RI values for the test compounds. The RI of each n-amide was defined as 100 times the number of carbon atoms. Owing to the linear relationships between RI vs. log of RT in the isocratic part and RT vs. RI in the gradient part of the solvent program, RI values of compounds eluted during isocratic and gradient parts were calculated by following equations respectively:
RIisocratic = (log Tx log Tz )100 / (log Tz 1 log Tz ) 100 z RI gradient = (Tx Tz )100 / (Tz 1 Tz ) 100 z
(1) (2)
Where, Tx is the retention time of the analyte; Tz is the retention time of the n-amides eluting just before the analyte and Tz+1 just after the analyte. Compound Set Diversity. A set of ninety-one endogenous metabolites, not included in the training set of the RI model, were selected for the study. The set consisted of a variety of chemical classes including alkaloids, amines, benzene and substituted derivatives, benzenoids, carbohydrates and conjugates, carboxylic acids, fatty acyls, flavin nucleotides, imidazopyrimidines, indoles, lipids and lipid like molecules, morphinans, organic acids, organic carbonic acids, organic nitrogen compounds, organic phosphonic acids, organoheterocyclic compounds, pteridines and derivatives, purine nucleosides, pyridines, quinolines, sphingolipids, steroids, stilbenes, and tetrahydroisoquinolines. Four of the compounds had an overall charge of +1 while the rest were neutral. The XLOGP3 values (predicted octanol-water partition coefficient) of the compound set varied from -5.0 to 8.5 (Figure 1) with a standard deviation of 3.4. The MIMs of the compounds varied from 105 Da to 785 Da with a standard deviation of 99 Da. Structural details for each test compound can be found in SI Table S3.
Figure 1. Histogram of XLOGP3 values of the 91 compounds used in the study. Data were extracted from the PubChem database. Data Analysis. Retention and mass spectral data were processed using Masslynx version 4.1 (Waters, Milford, MA, USA). The experimental survival yields of precursor ions at each collision energy were calculated using equation 3 based on a code written in Python version 3 (http://www.python.org). Survival Yield (%) =
precursor ion intensity 100 (precursor ion + fragment ion) intensities (3)
Candidate Structures. Candidate structures were downloaded from PubChem and HMDB by matching the experimental MIM within a relative mass error of ±15 ppm.36 MIMs were calculated by averaging scans with counts higher or equal to 25% of the respective precursor ion peak apex in the CID spectrum acquired at an energy which resulted in a precursor survival yield closest to 20%.36 The relative mass error (15 ppm) was calculated at the 3-simga limit when comparing actual to experimental MIMs. PubChem was quarried using an in-house program written in python version 3 using underlying cheminformatics functions of Rdkit (http://rdkit.org/)37 and power user gateway (PUG) service (https://pubchem.ncbi.nlm.nih.gov/pug/). Downloaded candidate sets of compounds were saved in 91 separate structure-data (SD) files. A snapshot of HMDB database was downloaded as a SD file and (http://www.hmdb.ca/downloads) used to acquire candidate compound structures. Resulting candidate sets were saved in 91 separate SD files. Pre-processing of candidates. Compounds in candidate sets that contained33 salts, had disconnections, contained heavy isotopes, had an overall charge (this filter was not applied to the candidate sets of the quaternary ammonium ions), contained only carbon and hydrogen or were duplicate stereoisomers were removed prior to processing. Compounds with elements other than CHNOPS were also removed. In-Silico RI Prediction and Filtering. In-silico RI prediction of the remaining candidates was carried out using topological descriptors calculated by winMolconn (version 2.1)38 and newly trained parameters. The learning process used to develop the RI model has been described previoulsy.34 Briefly, a 4x10x10 artificial neural network ensemble model was built on RI data for 2157 compounds (1955 commercial synthetic compounds and 202 confirmed human endogenous metabolites) measured according to the protocol described above. The model was trained according to the learning method described in the previous publication34
ACS Paragon Plus Environment
3
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
with RPROP40 back propigation on a network architecture of 47 input neurons and one hidden layer of 23 neurons. The significant difference between the model used for this study and the previous publication is that confirmed human endogenous metabolites were deliberately excluded from the model data set in the previous publication. This was done, in part to determine if a model based exclusively on relatively simple synthetic compounds could be used to make reasonable predictions for more complex human metabolites. To facilitate this, 202 confirmed human metabolites were set aside as an independent validation set. For this study, the independent validation set was reintroduced to the model data set to. After the additional data was added, the model was rebuilt using the same descriptors and learning method as the previousl study. The metabolite data was reintroduced because the goal of this present study was to test RI model performance in the context of the MolFind algorithm, not to test if a model based on synthetic compounds could be used to predict the RI of metabolites. The addition of the confirmed human metabolites makes use of all available data so as to achieve the maximal applicability domain. RI predictions were made for compounds in the PubChem and HMDB candidate lists corresponding to each of the 91 unknowns. Candidates with predicted RI values that deviated more than a threshold value were eliminated. The threshold windows were chosen based on an algorithm that utilized the experimental RI value of the “unknown” and the similarity of each candidate to model data. Similarity was evaluated using a partial molecular fingerprint encoded in a bit key. The filter windows can be found in Table 4. The bit key similarity approach and algorithm for derivation of the filter windows are discussed in section SI-4. The resulting RI-filtered candidate sets were saved separately. In-silico Predictive Fragmenters. CFM-ID, CSI:FingerID, Mass Frontier and MetFrag were used without modifications to rank candidate sets resulting from pre and RI filtering. MAGMa19 was not used in the study as it does not support processing compounds with an overall charge of +1 (adduct type [M+]). Relative and absolute mass errors of 15.0 ppm and 0.01 Da were used to annotate fragments against peaks in respective experimental CID spectra with relative ion intensities closest to 20% survival yield (Equation 3). For single energy CFM-ID (version 2.0), the pre-trained model params_se_cfm, having parameter file: param_output.log was used (https://sourceforge.net/p/cfm-id/wiki/Home/). The experimental CID spectrum was repeated in all three energies (low, medium, high) such that, CFM-ID assigned an average Jaccard score (default method) in ranking candidates.23 CSI:FingerID (version 3.5) was run with instrument mode set to "qtof" and setting all the other parameters to defaults. Candidates with molecular formulas having tree scores lower than the cut-off