Subscriber access provided by READING UNIV
Article
Biologically Consistent Annotation of Metabolomics Data Nicholas Alden, Smitha Krishnan, Vladimir Porokhin, Ravali Raju, Kyle McElearney, Alan Gilbert, and Kyongbum Lee Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.7b02162 • Publication Date (Web): 20 Nov 2017 Downloaded from http://pubs.acs.org on November 21, 2017
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
Analytical Chemistry is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 17
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
Biologically Consistent Annotation of Metabolomics Data Nicholas Alden1, Smitha Krishnan1, Vladimir Porokhin2, Ravali Raju3, Kyle McElearney3, Alan Gilbert3, Kyongbum Lee1* 1
Tufts University, Department of Chemical and Biological Engineering, Medford, MA, 02155 USA. 2Tufts University, Department of Computer Science, Medford, MA, 02155 USA. 3Biogen Idec, Cambridge, MA, 02142, USA. *Corresponding Author:
[email protected] Abstract Annotation of metabolites remains a major challenge in liquid chromatography-mass spectrometry (LCMS) based untargeted metabolomics. The current gold standard for metabolite identification is to match the detected feature with an authentic standard analyzed on the same equipment and using the same method as the experimental samples. However, there are substantial practical challenges in applying this approach to large datasets. One widely used annotation approach is to search spectral libraries in reference databases for matching metabolites; however, this approach is limited by the incomplete coverage of these libraries. An alternative, computational approach is to match the detected features to candidate chemical structures based on their mass and predicted fragmentation pattern. Unfortunately, both of these approaches can match multiple identities with a single feature. Another issue is that annotations from different tools often disagree. This paper presents a novel LC-MS data annotation method, termed Biologically Consistent Annotation (BioCAn) that combines the results from database searches and in silico fragmentation analyses, and places these results into a relevant biological context for the sample as captured by a metabolic model. We demonstrate the utility of this approach through an analysis of CHO cell samples. The performance of BioCAn is evaluated against several currently available annotation tools, and the accuracy of BioCAn annotations verified using high purity analytical standards.
ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Metabolomics is an expanding field of research concerned with the comprehensive characterization of small molecule metabolites in biological systems. Metabolites are substrates and products of essential reactions that convert nutrients into energy, eliminate harmful chemicals, provide building blocks for biosynthesis, or mediate signaling and regulatory pathways. The profile of metabolites in a cell or organism thus reflects the engagements of various biochemical pathways, providing direct information about cellular phenotypes in specific environments. In recent years, metabolomics has been broadly adopted in a variety of scientific applications. A notable example is the use of metabolomics in the discovery of biomarkers for diseases1,2. Increasingly, metabolomics is also viewed as a promising approach to gain mechanistic insights into complex biological processes, and has been used to study disease progression3, effects of xenobiotics4 and developmental behavior5. Depending on whether the goal is to quantify a known panel of metabolites or to globally profile the metabolites, respectively, metabolomics experiments can be targeted or untargeted. In the latter case, obtaining meaningful biological information from the data hinges on reliably and efficiently resolving the chemical identities of the detected compounds, which remains a major bottleneck6. Liquid chromatography coupled mass spectrometry (LC-MS) has become a popular option for metabolomics due to the versatility and sensitivity of LC-MS instruments7. Currently, the gold standard for confirming the identity of a detected compound (represented by a data feature) is to match the compound to a chemical standard run exactly as the sample. The identity is considered confirmed if the feature matches the chemical standard by two or more orthogonal metrics such as monoisotopic mass, chromatographic retention time (RT) and fragmentation (MS/MS) spectrum8. Unfortunately, establishing a comprehensive spectral library encompassing all possible metabolites that might be present in the sample of interest is impractical for most individual laboratories due to the resources required for this effort. Further, chemical standards for many metabolites are too expensive or unavailable from commercial sources9. The most basic approach to annotating metabolomics data is to assign putative identities by matching the mass of a feature to compounds in chemical libraries such as KEGG, PubChem and ChemSpider. However, this often returns multiple indistinguishable, often incorrect, matches. The MS/MS spectrum of a metabolite expresses structural information, thus providing a useful measure for metabolite identification. Spectral databases such as NIST17, HMDB, Metlin, and MassBank provide searchable libraries of MS/MS spectra that have been experimentally generated from chemical standards run at several collision energies on different instruments10–13. These databases, along with software such as NIST’s MS Search and MS-FINDER14, provide powerful MS/MS matching tools for annotation. MS-FINDER and Sirius15 also include tools for formula prediction. While spectral databases have steadily added new compounds, the coverage of metabolites by their MS/MS libraries remains incomplete. One promising way to expand coverage is to predict the MS/MS spectra of chemicals based on their structure using in silico fragmentation analysis (or structure elucidation) tools such as Metfrag, CFM-ID, and CSI:Finger-ID16–18. Similar to database searches, the predicted spectrum of a candidate chemical is compared against the measured MS/MS spectra of detected features to determine if the chemical could be the identity of a feature. These tools, as well as MS-FINDER, attempt to determine the structure of the fragmented compound using the accurate mass and MS/MS spectra. However, as is the case for the reference databases, annotations from in silico tools
ACS Paragon Plus Environment
Page 2 of 17
Page 3 of 17
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
often disagree with each other. In both cases, a single tool may return multiple putative identities for a given feature. In this paper, we present a novel workflow that augments database searches and in silico fragmentation analyses by explicitly utilizing knowledge of the sample’s biological context. This workflow, which we call Biologically Consistent Annotation (BioCAn), constructs a biological context network combining information from the MS data, the outputs from five different annotation tools, and a metabolic model representing the enzymatic reactions possible in the biological system of interest. Confidence in a particular annotation for a given feature is improved by the presence and confidence of metabolites connected to the feature through substrate-product relationships in the metabolic model. Applying this workflow to untargeted LC-MS data on cultured Chinese hamster ovary (CHO) cells, we show that BioCAn improves on both reference database searches and in silico fragmentation analysis. Importantly, BioCAn can suggest putative identities for MS data features even when the features lack associated MS/MS data.
Methods Chemicals and reagents. The proprietary basal media and feed used in Chinese hamster ovary (CHO) cell cultures has been described previously19. Unless otherwise noted, all other chemicals and reagents were purchased from Sigma Aldrich (St. Louis, MO).
Cell Culture Spent medium samples were taken from fed-batch cultures of six recombinant monoclonal antibody producing CHO cells cell lines with varying growth rates. Cultures were performed in bioreactors utilizing chemically defined proprietary growth and feed media (Biogen, Cambridge, MA). The cultures were sampled at approximately the midpoint of the exponential growth phase and at close to peak cell density. The samples were immediately centrifuged under refrigeration to remove the cells. The supernatant was collected into fresh sample tubes and stored frozen at -80°C.
Sample Extraction Previously frozen medium samples were thawed on ice and added to the extraction buffer (100% methanol) at a 1 to 3 sample to methanol ratio (v/v) to precipitate the proteins in the sample. The mixture was vortexed for 15 seconds, and the protein precipitates were pelleted by centrifugation at 15,000 x g for 15 minutes at 4°C. The supernatant (200 ߤL) was removed and dried using a vacuum concentrator (Eppendorf Vacufuge 5301). On the day of the LC-MS experiment, the dried sample was reconstituted in 100 ߤL methanol/water (50/50, v/v).
Untargeted LC-MS Untargeted analysis of metabolites was performed using information-dependent acquisition (IDA) experiments performed on a quadrupole time-of-flight (TOF) mass spectrometer (TripleTOF 5600+, AB Sciex) with an electrospray ionization source. The IDA experiment consisted of a TOF MS survey scan and four dependent product ion (MS/MS) scans for the highest intensity unique masses in each TOF scan. Dynamic background subtraction was applied to trigger fragmentation when counts of a precursor ion are rising quickly over several scans to ensure that ions are selected near the top of their LC peaks and to minimize redundant MS/MS collection.
ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Chromatographic separation was performed using a binary pump HPLC system (1260 Infinity, Agilent). To obtain broad coverage of metabolites, each CHO cell culture sample was run three times using different combinations of ionization modes (positive or negative) and liquid chromatography (hydrophilic interaction and reverse phase) methods. A polar end-capped C18 column (Synergi HydroRP, Phenomenex) was used for separation of hydrophobic compounds. For this column, Solvent A was 0.1% formic acid in water and Solvent B was 0.1% formic acid in methanol. The column was maintained at 15°C with the gradient 0-8 minutes: 3% B, 8-38 minutes: 3->95% B, 38-45 minutes: 95% B, 45-47 minutes: 95->3% B, 47-55 minutes: 3% B. An aminopropyl column (Luna NH2, Phenomenex) was used for separation of hydrophilic and polar compounds. Solvent A was 20 mM ammonium acetate in 95:5 water:acetonitrile brought to pH 9.45 using ammonium hydroxide. Solvent B was 100% acetonitrile. The column was maintained at 25°C with the gradient 0-15 minutes: 85-0% B, 15-28 minutes: 0%B, 28-30 minutes: 0->85% B, 30-60 minutes: 85% B.
Data Preprocessing Data preprocessing was performed using scripts written in R (R Core Team, Vienna, Austria). Raw data files were converted to mzML20 using the ProteoWizard msConvert tool21. The R package XCMS22 was used for peak detection and alignment. The CAMERA tool was used to detect masses related to isotopes, in-source fragmentation, and adducts that did not result from the addition or removal of a proton23. These peaks were excluded from the feature tables to reduce the number of falsely identified metabolite peaks. The output from these preprocessing steps was a table of unique MS features (m/z, retention time) with data from each sample represented by their response (area under the extracted ion chromatogram curve).
Annotation Tools Fragmentation (MS/MS) data were collected for all MS features using the R package mzR20, except in cases where product ions could not be generated due to poor ionization or low abundance of the precursor ion. The MS/MS data were analyzed using five different annotation tools: Metfrag16, CFM-ID17, NIST17, Metlin10 and HMDB11. Metfrag analysis was performed using an R package (https://github.com/c-ruttkies/MetFragR). CFM-ID was run using a command-line utility (https://sourceforge.net/projects/cfm-id/). Custom R and C scripts were written to automate NIST17 database annotation using the NIST MS Search program. Metlin and HMDB both provide manual online search tools that can be used to compare MS/MS spectra from a sample to chemical standards in the databases’ spectral library. The parameters used for in silico and reference database tools are found in Supporting Information, Tables S1 and S2 respectively. The feature table used for input into BioCAn can be found in the supporting information at http://pubs.acs.org.
Metabolic Model As the default option, the annotation tool described in this paper assembles a metabolic model for the organisms of interest from enzyme and gene orthologs data in KEGG (Kyoto Encyclopedia of Genes and Genomes)24. This is done using a previously described two-step automated procedure25. The only required input from the user is an organism code. First, the KEGG Orthology identifiers (K numbers) and Enzyme Commission (EC) numbers associated with the organism code (e.g., “cge” for Chinese hamster) are collected using the KEGG REST API. These K and EC numbers are linked to KEGG reaction identifiers (R numbers). In KEGG, some reactions are linked to both EC and KO numbers, some to only KO numbers, and some to only EC numbers. Next, we link the reactions with their primary substrate-product pairs as
ACS Paragon Plus Environment
Page 4 of 17
Page 5 of 17
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
defined by KEGG’s RCLASS data. The RCLASS data link each reaction to the corresponding KEGG compound identifiers (C number) of the substrates and products. All of the metabolic models in this study were constructed using these automated steps. A schematic of the modeling procedure can be found in Figure S1 of Supporting Information. An R script implementation is included in BioCAn and available upon request. The rationale for using an automated procedure to construct the model is that this supports a userfriendly and generally applicable workflow for samples from many different organisms. In the case of CHO cells, there are a number of published models available, ranging from kinetic models of central carbon metabolism26 to genome-scale models27,28. However, this may not be the case for other less widely investigated organisms, and thus we did not assume that a published model would always be available. Nevertheless, using a published or manually curated model is also an option. Currently, the only requirement is that the reactions in the model define substrate-product pairs in terms of C numbers or some other, widely used compound identifiers that can be converted into C numbers.
BioCAn Network Construction The metabolic model was represented as a graph using the R packages iGraph (http://igraph.org) and visNetwork (https://CRAN.R-project.org/package=visNetwork). In this graph, the edges and nodes represent, respectively, the reactions and compounds. A pair of compound nodes was connected by an edge if there was a reaction that involved the pair as a reactant and product. This network graph was then pruned by mapping the feature table onto the compound nodes. Only masses corresponding to ions formed through protonation or deprotonation were considered; masses that were flagged as isotopes or in-source (e.g., Na+, K+, NH4+) adducts or fragments by the CAMERA tool were not mapped to compounds in this network. A compound was retained if 1) its mass was in the feature table and was connected by a reaction edge to at least one other compound or 2) if it was connected to at least two other compounds whose masses were in the feature table. The latter criterion accounts for cases where a potentially unstable or undetectable (e.g., due to poor ionization) compound was not in the feature table, but the compound’s immediate substrate and product are present. To aid in mapping compounds to the network, annotations from different tools were first matched to each other using KEGG IDs. If KEGG IDs were not provided by the tool, alternative identifiers such as CAS number and InChI Key were matched to a KEGG ID using information in the KEGG database and the Metabolomics Workbench REST API (http://www.metabolomicsworkbench.org/).
BioCAn Scoring Each compound node in the pruned BioCAn network was first assigned an Individual node Score (IS), and then BioCAn score (Annotation Score, AS) by aggregating the results from the five annotation tools. Tables S1 and S2 in Supporting Information summarize the search parameters used for each tool, including the criteria for accepting or rejecting a match between a feature and a chemical identity suggested for this feature. For the reference databases, the acceptance criterion value was set based on the scores calculated by the search tools for the chemical standards (Figure S2). If a reference database (NIST, Metlin or HMDB) search found a match between a compound and detected feature, then this added a value of 1 to the corresponding IS. If an in silico fragmentation tool (Metfrag or CFM-ID, top rank) found a match, then this added a value of 0.25 to the IS. If the exact mass of a compound matched an accurate mass in the feature table, then this added a value of 0.125 to the corresponding IS. Each IS was then scaled to a value between 0 and 1 by dividing the IS by the maximal possible value of 3.625.
ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Finally, the AS for a given compound node was calculated by summing the node’s IS with the IS values of all other nodes within two reactions of the given compound.
Experimental Validation of Annotations High purity standards were run using the same LC-MS conditions as the experimental samples. An annotation was considered confirmed by MS/MS if the dot product and reverse dot product scores for the annotation were both greater than 700. The scores were calculated between MS/MS spectra of sample and chemical standard using NIST’s MS Search. If a feature lacked high-quality MS/MS data, a RT match was considered. An annotation was considered confirmed by RT if the RTs of the sample and standard fell within a 1 min window. The RT window was set to account for batch-to-batch variations between samples and standards run at different times.
Results Annotation by Reference Databases and In Silico Tools We first compared the annotation performance of NIST, Metlin, HMDB, Metfrag, and CFM-ID by analyzing a set of 21 chemical standards representing common metabolites (amino acid derivatives, sugars and vitamins, Table S3). Each standard was run in both ESI positive and negative modes, and the resulting MS/MS spectra were analyzed using each of the five annotation tools as if they were unknown samples (Figure 1A). An annotation tool scored a ‘Match’ if the tool identified only the correct compound. The ‘Match’ rate ranged from 15 to 40% for the different tools. Annotating the MS/MS spectra based on the consensus of two more tools improved the ‘Match’ rate to just over 50%. A tool scores a ‘Mismatch’ if the correct compound is not among the list of compounds identified by the tool. This occurred at a rate of less than 10%. A third category of annotations is ‘Ambiguous,’ where a tool identifies multiple matches, including the correct compound. Nearly 40% of the annotations obtained using the consensus criterion fell into this category. There was not a single case where all five tools agreed on a single annotation. Next, we applied the five annotation tools to untargeted metabolomics data collected on 12 spent medium samples taken from fed-batch cultures of six CHO cell lines with varying growth rates. Extracted samples were run through three different combinations of LC and MS methods (Table S4). After processing the raw data and applying quality control steps, the number of detected features ranged from approximately 2,400 to 4,800 depending on the LC-MS method. Together, the five annotation tools assigned putative identities to 20% of the features. However, less than 1% of annotated features were assigned the same identity by all five methods (Figure 1B). To compare the annotations from different tools, we defined a ‘consensus set’ of features that were assigned the same chemical identity by at least two different tools. The annotations from each tool were then compared against this consensus set (Figure 1C). For annotations from NIST, Metlin, HMDB, and CFM-ID, the level of agreement with the consensus set ranged from 70% to 80%. The level of agreement for Metfrag was substantially less, around 50%. This is likely due to the scoring method utilized by Metfrag. In cases where a detected mass matches only one compound, Metfrag assigned this compound the top score for that mass if any of the predicted fragments matches at least one fragment detected in the sample. Taken together, these results suggest that aggregating the outputs from several different annotation tools can help build greater confidence in the annotations. However, this consensus
ACS Paragon Plus Environment
Page 6 of 17
Page 7 of 17
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
approach does not resolve the issue of ambiguous annotations where multiple chemical identities are proposed for the same MS data feature.
Figure 1: A: A set of 21 chemical standards were analyzed using the same protocol and data processing method as the CHO cell culture samples. The MS/MS data from these standards were then annotated using 5 different tools. See text for definitions of annotation categories. B: Agreement between the tools on putative identities for features detected in CHO cell culture samples. C: A consensus set was built containing all compounds whose presence in the sample was indicated by two or more annotation tools. The annotations from individual tools were compared to this consensus set to calculate the percent agreement.
Many of the annotations suggested by individual tools (or even the consensus of two or more tools) could not be explained biologically. The spent medium samples were collected from CHO cell monocultures grown in chemically defined media. Therefore, nearly all of the compounds present in the spent medium samples should be products of CHO cell metabolism. However, only 14% (353 out of 2,621) of the compounds that could be matched to KEGG IDs identified by at least one of the five annotation tools could be mapped to a genome-scale metabolic model of the Chinese hamster (Figure S3). Interestingly, when all five tools agreed on an annotation for a given feature, all but one of these putatively identified metabolites were present in the CHO cell model, suggesting that using the consensus criterion could reduce the false identification rate. Based on these observations, we next
ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
investigated whether annotation of the detected features could be further improved by characterizing the biochemical context of each feature.
Network Construction and Local Neighborhood Analysis To define the biochemical context for the detected MS data features, we constructed a reaction network linking the putatively identified compounds based on relationships between reactants and products as defined in the KEGG RCLASS database. This network comprised 452 compound nodes corresponding to 463 unique features connected by 676 reaction edges. The complete network mapping all of the masses detected in the twelve CHO cell culture samples is shown in Figure 2A. Each of the 452 compound nodes in the network was further analyzed in the context of the node’s immediate reaction neighborhood to determine the likelihood the compound was indeed detected by the LC-MS experiment. A node’s reaction neighborhood comprised all compounds within two reactions of the node (Figure 2B-D). Each compound node was assigned an individual node score (IS) and overall annotation score (AS) as described in Methods.
Figure 2: A: BioCAn network built from the CHO cell model and untargeted LC-MS data. Colors indicate confidence levels in the annotations, ranging from low (black, no matching mass in the feature table) to high (green, putative identity assigned by two or more reference databases and in-silico tools). B-D: Local reaction neighborhoods of three compounds with the same monoisotopic mass: C00041 (L-alanine), C00099 (beta-alanine) and C00213 (sarcosine), respectively.
ACS Paragon Plus Environment
Page 8 of 17
Page 9 of 17
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
The putative identity for each feature mapped to the network is the associated compound with the highest AS. Figure 2B-D show three neighborhoods associated with a single feature with a monoisotopic mass of 89.05. These neighborhoods are centered on the compounds C00041 (Figure 2B), C00099 (Figure 2C) and C00213 (Figure 2D), with AS of 7.5, 3.4 and 2.5 respectively. Based on these scores, the BioCAn algorithm identified this feature as C00041, or L-alanine. Each of the 463 CHO cell sample features mapped to the network was assigned a putative identity using this approach. Due to overlap in the masses detected by the different LC-MS methods, these 463 features were mapped to a smaller number (264) of distinct compounds.
Experimental Verification of Annotations We tested the accuracy of BioCAn results by analyzing standards that correspond to a subset of putative identities determined by BioCAn. In total, 50 chemical standards (Table S5) (representing 19% of annotated compounds) were selected for confirmation of BioCAn annotations based on price and availability of these chemicals from vendors. These standards were analyzed using the same experimental protocol and data processing method as the CHO cell culture medium samples. Of particular interest were four compounds identified only by BioCAn. Table 1 compares the annotation performance of each tool against BioCAn. When multiple matches were suggested by an annotation tool, only the highest scoring match was considered as the tool’s annotation. In the case of a tie, all matches with the tied score were considered annotations. This prevented a tool from achieving a high rate of correct annotations by simply identifying a large number of matches. The in silico tools generated many more false positive matches from tied scores than the database tools or BioCAn, which resulted in lower precision and higher false discovery rate (FDR). The database tools outperformed the in silico tools in terms of specificity (true negative rate), precision, and FDR. However, two of the three database tools performed slightly worse in terms of sensitivity (true positive rate). A likely explanation for the lower sensitivity is that the limited coverage of database tools results in fewer assigned matches. BioCAn, which aggregates matches from databases and in silico tools, correctly identified 37 of the 50 standards (33 by MS/MS and 4 by RT alone): more than any other tool, while generating the second fewest number of false positive matches. This resulted in the highest sensitivity and precision (Figure 3). Table 1: Classification performance by annotation tool. Abbreviations: TP – True Positive, FP – False Positive, TN – True Negative, FN – False Negative, FPR – False Positive Rate, Sens – sensitivity, Spec – specificity, Prec – precision, FDR – False Discovery Rate
TP FP TN FN FPR Sens Spec Prec FDR
Metlin HMDB 25 10 20 10 337 347 25 40 0.06 0.03 0.50 0.20 0.94 0.97 0.56 0.50 0.44 0.50
NIST Metfrag CFM-ID BioCAn 15 22 22 37 13 70 58 13 344 287 299 344 35 28 28 13 0.04 0.20 0.16 0.04 0.30 0.44 0.44 0.74 0.96 0.80 0.84 0.96 0.54 0.24 0.28 0.74 0.46 0.76 0.73 0.26
ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Figure 3: Statistical classification performance for different annotation tools. Data shown for Reference DBs and in silico tools represent averages for each type of annotation tool. Error bar represents one standard deviation.
Compared to the other tools, BioCAn also achieved the lowest FDR at 26%. This is still a relatively high rate, and confirmation of annotation results using authentic standards is clearly necessary to achieve Level 1 or 2 identification according to the Metabolomics Standards Initiative guidline8.
Importance of Metabolic Model We next sought to assess the sensitivity of BioCAn performance with respect to the choice of the underlying metabolic model. To this end, we repeated the above analysis with metabolic models of several well-characterized organisms selected from different Kingdoms of life. Metabolic models were constructed for each species using the same approach as the Chinese hamster (C. griseus) model. The BioCAn workflow was then applied using each of the six different models, and the annotations compared to the results obtained using the Chinese hamster model. In general, models that were similar to the Chinese hamster model (as determined by a Jaccard score for overlapping reactions) produced similar annotations with fewer differences in true and false positives (Figure 4A).
ACS Paragon Plus Environment
Page 10 of 17
Page 11 of 17
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
Figure 4: A: Comparison of BioCAn annotations for the CHO cell data obtained with different metabolic models. Each model was assembled from KEGG data using an automated procedure as described in Methods. B: Effects of model incompleteness on accuracy of annotations. Data shown are averages of 10 replicate runs with different sets of randomly selected reactions removed for each run. Trend lines added for visual emphasis. Y-axis label refers to number of TP (blue) or FP (orange).
Not surprisingly, the human model shared more true positives with C. griseus than the microorganism and plant models. The other animal models fell in between. This shows that the choice of metabolic model has the potential to influence the results of annotation. However, the trends also suggest that using the metabolic model of a closely related organism can be a reasonable substitute in cases where a metabolic model for the organism of interest is unavailable. To evaluate the impact of model completeness on annotation, BioCAn was rerun on the same dataset with a subset of reactions randomly removed from the metabolic model. To account for variability in the annotation results due to the random selection of reactions that were removed, the analysis was repeated 10 times. Even when 40% of the reactions were removed, the number of true positives decreased by only 10%. These observations underscore the importance of constructing the BioCAn network from a complete model, but suggest that the accuracy of annotations is relatively robust even when the model is incomplete. In this regard, BioCAn could be used to suggest putative identities for many detected features even if the metabolic reactions expressed in the underlying biological system have been only partially characterized and cataloged.
Discussion In this paper, we describe a novel method for annotating metabolomics data from untargeted LC-MS experiments. This method combines the search results from reference databases and in silico analyses
ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
with available information about the metabolic reactions expressed in the biological system of interest to place each detected MS feature into a relevant biochemical context. This context then becomes part of the evidence for assigning chemical identities to the feature. The current gold standard for metabolite identification is to match two or more physicochemical properties of a detected feature against those of an authentic chemical standard analyzed on the same instrument using the same analytical method8. Applying this standard for annotation of untargeted LCMS data requires a large, comprehensive in-house library of several thousand metabolites generated by analyzing the standards on the same instrument, ideally using several collision energies, chromatographic methods and ionization modes. In a recent study, researchers at Metabolon Inc. confirmed the identities of 435 metabolites detected in cerebrospinal fluid using untargeted MS experiments29. This work required four separate LC-MS methods and an in-house spectral library containing over 4,000 compounds. Several other recent studies that relied on Metabolon’s services for metabolomics reported similar numbers of identified compounds30,31. Construction of a large spectral library is not a trivial task, and coverage is still limited by the compounds available from chemical vendors9,31. The latter limitation could be especially problematic for discovery-oriented studies, as vendors do not carry the full diversity of metabolites expected in samples from different biological systems. In comparison, the method described in this paper annotated 264 compounds, with an estimated sensitivity (true positive rate) of 74% and specificity (true negative rate) of 96%. A practical alternative to building a custom library is to utilize spectral libraries in public reference databases. Over the last several years, public libraries have steadily increased the number of compounds with MS/MS spectra. A number of powerful annotation tools are now available that can utilize MS/MS data for metabolite identification. In this paper, we show that combining the annotations from multiple tools and identifying a consensus can help improve confidence in the annotation accuracy. However, this improvement can be relatively modest, and the discrepancy between results from different tools demands additional strategies for reconciling the differences (Figure 1). Importantly, the aforementioned annotation tools offer limited utility in assigning putative identities to MS data features that do not have high quality MS/MS data, potentially excluding a large fraction (35% in the case of the CHO cell samples of this study) of detected features from further analysis. One way to upgrade the information content in untargeted LC-MS data for annotation purposes is to incorporate knowledge of the biological context of the sample. For example, the mummichog algorithm constructs activity networks from a metabolic model of the relevant biological system and experimental data32. These networks are used to determine significantly active metabolic modules and pathways based on the detected masses, building confidence in the metabolites within these networks. In a study examining the metabolome of human monocyte-derived dendritic cells32, mummichog was able to putatively identify 77 metabolites out of 7,995 detected features. On a conceptual level, the BioCAn workflow is similar to mummichog in that both methods construct a biological context network based on a metabolic model and LC-MS data. A key difference, however, is that BioCAn takes additional steps to score individual annotations based on the connectivity of the putatively identified compound to other compounds in the network. A second key advance in BioCAn is that it integrates annotations from multiple sources (NIST, Metlin, HMDB, Metfrag and CFM-ID) into the context network. In this way, each putatively identified compound becomes part of the evidence for determining the likelihood an annotated compound was indeed observed by the LC-MS experiment.
ACS Paragon Plus Environment
Page 12 of 17
Page 13 of 17
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
Figure S4, showing the identification of pyridoxine, illustrates the advantage of integrating multiple annotation scores into a biological context network. Noradrenaline and pyridoxine have the same exact mass, and both were identified by at least one annotation tool. The noradrenaline neighborhood comprises six other compounds, with all but one node representing only mass matches. The pyridoxine neighborhood comprises five other compounds, with every node representing one or more reference database matches. BioCAn assigned a score of 3.7 and 0.7 to pyridoxine and noradrenaline, respectively, reflecting the additional evidence for pyridoxine’s metabolic substrates and products provided by the other annotation tools. Two key factors that influence the performance of BioCAn are the coverage of the experimental data and accuracy of the underlying metabolic model. Consistent with prior studies31, we found that using multiple LC-MS methods substantially increase coverage of the metabolome. For the CHO cell data analyzed in this study, only 10% of the masses in the feature table were detected by all three methods. When the same LC method was paired with two different ionization modes, only 18% of detected masses overlapped. The breadth of coverage directly affects the completeness of the biological context network (Figure S5), which in turn determines the evidence available for BioCAn. This observation was further corroborated by an analysis of a sparse data set from a HepG2 cell culture experiment described in the Supporting Information (Figure S6). On the other hand, a reduction in the number of annotations due to the use of a single LC-MS method did not affect the conclusions drawn (Table S6 and S7, Figures S7 and S8). One potential way of expanding metabolome coverage is to use Data-Independent Acquisition (DIA) mass spectrometry. Unlike IDA, DIA experiments attempt to measure fragment masses for all detected precursor ions, which should in principle increase the number of features with MS/MS data. Linking MS/MS spectra to their respective precursor ions remains challenging, but recent advances in deconvolution algorithms in tools such as MS-DIAL33 suggest an exciting possibility to utilize DIA experiments in LC-MS based metabolomics. The second key input from the user is a metabolic model describing the biological system of interest. In this study, we used an automatically generated model for the CHO cell built from KEGG data. The automation provides out-of-the-box convenience for users. However, the workflow can also accept manually constructed models. Indeed, we expect the utility of BioCAn to improve further as the catalog of high-quality genome-scale models continues to expand. At present, the only requirement is that the user-supplied model’s format is compatible with KEGG’s system of reaction and compound identifiers. Prospectively, this requirement could be relaxed by including a formatting tool that converts the chemical identifiers of model compounds and annotation tool results into a common format such as InChI or SMILES. One potential consequence of automating model generation is that this could lead to missing or erroneous reactions. We found that annotations obtained with an incomplete model largely agree with those made using a complete model, suggesting that model completeness has a relatively small effect on the accuracy of annotations (Figure 4). Instead, an incomplete model more severely influences the number of metabolites that can be identified. Even if the metabolic model were complete, the reliance on a model representing known reactions still limits the ability to discover previously uncharacterized metabolites. One approach to address this limitation is to augment the metabolic model with compounds that could result from substrate
ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
flexibility, for example using computational prediction tools such as MINEs34, PROXIMAL35, and MetaPrint2D36. For complex systems harboring multiple, incompletely characterized species, metabolic functions could be inferred from related species using metagenomic analysis tools such as PICRUSt37 or Tax4Fun38. In the present study, we found that using metabolic models of closely related species leads to similarly accurate predictions as the species-specific model.
Conclusions In this paper, we show that BioCAn can outperform currently available annotation tools when applied to untargeted LC-MS data on samples from cultures of a single cell type (specifically, CHO cells). Further work is warranted to develop modeling tools that can expand the basic set of enzymatic reactions that are already known for a given biological system. For example, this could be accomplished by learning the patterns of chemical transformations encoded in the system’s enzymes. Further work is also warranted to refine the biological context network based on experimental evidence for the reactions included in the underlying metabolic model. This refinement would be particularly useful for studies on systems comprised of many species where the enzymatic reactions possible in the system are ill defined and evidence for the reactions needs to be collected through metagenomic approaches. Author Contributions NA developed the BioCAn workflow and performed the data analysis of CHO samples. CHO samples used in the study were generated by KM, AG and RR from Biogen Idec. SK generated and analyzed HepG2 data. VP aided in the development of a C script for the automation of the NIST software. This work was supported by Biogen Idec (Cambridge, MA) and a grant from the NIH (award # CA211839-01). Conflict of Interest The authors declare no financial or commercial conflict of interest. Supporting Information Seven additional tables and eight figures referenced in the text. Additional evaluation of BioCAn performance on a sparse dataset. A zip file containing the CHO and HepG2 feature tables. This material is available free of charge via the Internet at http://pubs.acs.org.
References (1) (2) (3) (4) (5) (6)
Menni, C.; Zierer, J.; Valdes, A. M.; Spector, T. D. Nat. Rev. Rheumatol. 2017, 13, 174– 181. Bhargava, P.; Calabresi, P. A. Mult. Scler. 2016, 22, 451–460. Lim, C. K.; Bilgin, A.; Lovejoy, D. B.; Tan, V.; Bustamante, S.; Taylor, B. V.; Bessede, A.; Brew, B. J.; Guillemin, G. J. Sci. Rep. 2017, 7, 41473. Manteiga, S.; Lee, K. Environ. Health Perspect. 2017, 125, 615–622. Li, Y.; Wang, X.; Hou, Y.; Zhou, X.; Chen, Q.; Guo, C.; Xia, Q.; Zhang, Y.; Zhao, P. J. Proteome Res. 2016, 15, 193–204. Benton, H. P.; Ivanisevic, J.; Mahieu, N. G.; Kurczy, M. E.; Johnson, C. H.; Franco, L.; Rinehart, D.; Valentine, E.; Gowda, H.; Ubhi, B. K.; Tautenhahn, R.; Gieschen, A.; Fields, M. W.; Patti, G. J.; Siuzdak, G. Anal. Chem. 2015, 87, 884–891.
ACS Paragon Plus Environment
Page 14 of 17
Page 15 of 17
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
(7) (8)
(9) (10) (11)
(12)
(13) (14) (15) (16) (17) (18) (19) (20)
(21) (22) (23) (24) (25)
(26)
Dettmer, K.; Aronov, P. A.; Hammock, B. D. Mass Spectrom. Rev. 2007, 26, 51–78. Sumner, L. W.; Amberg, A.; Barrett, D.; Beale, M. H.; Beger, R.; Daykin, C. a.; Fan, T. W.M.; Fiehn, O.; Goodacre, R.; Griffin, J. L.; Hankemeier, T.; Hardy, N.; Harnly, J.; Higashi, R.; Kopka, J.; Lane, A. N.; Lindon, J. C.; Marriott, P.; Nicholls, A. W.; Reily, M. D.; Thaden, J. J.; Viant, M. R. Metabolomics 2007, 3, 211–221. Broeckling, C. D.; Ganna, A.; Layer, M.; Brown, K.; Sutton, B.; Ingelsson, E.; Peers, G.; Prenni, J. E. Anal. Chem. 2016, 88, 9226–9234. Tautenhahn, R.; Cho, K.; Uritboonthai, W.; Zhu, Z.; Patti, G. J.; Siuzdak, G. Nat. Biotechnol. 2012, 30, 826–828. Wishart, D. S.; Jewison, T.; Guo, A. C.; Wilson, M.; Knox, C.; Liu, Y.; Djoumbou, Y.; Mandal, R.; Aziat, F.; Dong, E.; Bouatra, S.; Sinelnikov, I.; Arndt, D.; Xia, J.; Liu, P.; Yallou, F.; Bjorndahl, T.; Perez-Pineiro, R.; Eisner, R.; Allen, F.; Neveu, V.; Greiner, R.; Scalbert, A. Nucleic Acids Res. 2013, 41, 801–807. Horai, H.; Arita, M.; Kanaya, S.; Nihei, Y.; Ikeda, T.; Suwa, K.; Ojima, Y.; Tanaka, K.; Tanaka, S.; Aoshima, K.; Oda, Y.; Kakazu, Y.; Kusano, M.; Tohge, T.; Matsuda, F.; Sawada, Y.; Hirai, M. Y.; Nakanishi, H.; Ikeda, K.; Akimoto, N.; Maoka, T.; Takahashi, H.; Ara, T.; Sakurai, N.; Suzuki, H.; Shibata, D.; Neumann, S.; Iida, T.; Tanaka, K.; Funatsu, K.; Matsuura, F.; Soga, T.; Taguchi, R.; Saito, K.; Nishioka, T. J. Mass Spectrom. 2010, 45, 703–714. Yang, X.; Neta, P.; Stein, S. E. Anal Chem 2014, 86, 6393–6400. Tsugawa, H.; Kind, T.; Nakabayashi, R.; Yukihira, D.; Tanaka, W.; Cajka, T.; Saito, K.; Fiehn, O.; Arita, M. Anal. Chem. 2016, 88, 7946–7958. Böcker, S.; Letzel, M. C.; Lipták, Z.; Pervukhin, A. Bioinformatics 2009, 25, 218–224. Ruttkies, C.; Schymanski, E. L.; Wolf, S.; Hollender, J.; Neumann, S. J. Cheminform. 2016, 8, 1–16. Allen, F.; Greiner, R.; Wishart, D. Metabolomics 2015, 11, 98–110. Dührkop, K.; Shen, H.; Meusel, M.; Rousu, J.; Böcker, S. Proc. Natl. Acad. Sci. 2015, 112, 12580–12585. Gilbert, A.; Mcelearney, K.; Kshirsagar, R.; Sinacore, M. S.; Ryll, T. Biotechnol. Prog. 2013, 29, 1519–1527. Martens, L.; Chambers, M.; Sturm, M.; Kessner, D.; Levander, F.; Shofstahl, J.; Tang, W. H.; Römpp, A.; Neumann, S.; Pizarro, A. D.; Montecchi-Palazzi, L.; Tasman, N.; Coleman, M.; Reisinger, F.; Souda, P.; Hermjakob, H.; Binz, P.-A.; Deutsch, E. W. Mol. Cell. Proteomics 2011, 10, R110.000133. Kessner, D.; Chambers, M.; Burke, R.; Agus, D.; Mallick, P. Bioinformatics 2008, 24, 2534– 2536. Benton, H. P.; Want, E. J.; Ebbels, T. M. D. Bioinformatics 2010, 26, 2488–2489. Kuhl, C.; Tautenhahn, R.; Böttcher, C.; Larson, T. R.; Neumann, S. Anal. Chem. 2012, 84, 283–289. Ogata, H.; Goto, S.; Sato, K.; Fujibuchi, W.; Bono, H.; Kanehisa, M. Nucleic Acids Res. 1999, 27, 29–34. Sridharan, G. V.; Choi, K.; Klemashevich, C.; Wu, C.; Prabakaran, D.; Pan, L. Bin; Steinmeyer, S.; Mueller, C.; Yousofshahi, M.; Alaniz, R. C.; Lee, K.; Jayaraman, A. Nat. Commun. 2014, 5, 5492. Robitaille, J.; Chen, J.; Jolicoeur, M. PLoS One 2015, 10, e0136815.
ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
(27)
(28)
(29) (30) (31) (32) (33) (34) (35) (36) (37)
(38)
Hefzi, H.; Ang, K. S.; Hanscho, M.; Bordbar, A.; Ruckerbauer, D.; Lakshmanan, M.; Orellana, C. A.; Baycin-Hizal, D.; Huang, Y.; Ley, D.; Martinez, V. S.; Kyriakopoulos, S.; Jiménez, N. E.; Zielinski, D. C.; Quek, L.-E.; Wulff, T.; Arnsdorf, J.; Li, S.; Lee, J. S.; Paglia, G.; Loira, N.; Spahn, P. N.; Pedersen, L. E.; Gutierrez, J. M.; King, Z. A.; Lund, A. M.; Nagarajan, H.; Thomas, A.; Abdel-Haleem, A. M.; Zanghellini, J.; Kildegaard, H. F.; Voldborg, B. G.; Gerdtzen, Z. P.; Betenbaugh, M. J.; Palsson, B. O.; Andersen, M. R.; Nielsen, L. K.; Borth, N.; Lee, D.; Lewis, N. E. Cell Syst. 2016, 3, 434–443.e8. Lewis, N. E.; Liu, X.; Li, Y.; Nagarajan, H.; Yerganian, G.; O’Brien, E.; Bordbar, A.; Roth, A. M.; Rosenbloom, J.; Bian, C.; Xie, M.; Chen, W.; Li, N.; Baycin-Hizal, D.; Latif, H.; Forster, J.; Betenbaugh, M. J.; Famili, I.; Xu, X.; Wang, J.; Palsson, B. O. Nat. Biotechnol. 2013, 31, 759–765. Kennedy, A. D.; Pappan, K. L.; Donti, T. R.; Evans, A. M.; Wulff, J. E.; Miller, L. A. D.; Reid Sutton, V.; Sun, Q.; Miller, M. J.; Elsea, S. H. Mol. Genet. Metab. 2017, 121, 83–90. Theriot, C. M.; Koenigsknecht, M. J.; Carlson, P. E.; Hatton, G. E.; Nelson, A. M.; Li, B.; Huffnagle, G. B.; Z Li, J.; Young, V. B. Nat. Commun. 2014, 5, 3114. Evans, A. M.; DeHaven, C. D.; Barrett, T.; Mitchell, M.; Milgram, E. Anal. Chem. 2009, 81, 6656–6667. Li, S.; Park, Y.; Duraisingham, S.; Strobel, F. H.; Khan, N.; Soltow, Q. A.; Jones, D. P.; Pulendran, B. PLoS Comput. Biol. 2013, 9, e1003123. Tsugawa, H.; Cajka, T.; Kind, T.; Ma, Y.; Higgins, B.; Ikeda, K.; Kanazawa, M.; VanderGheynst, J.; Fiehn, O.; Arita, M. Nat. Methods 2015, 12, 523–526. Jeffryes, J. G.; Colastani, R. L.; Elbadawi-Sidhu, M.; Kind, T.; Niehaus, T. D.; Broadbelt, L. J.; Hanson, A. D.; Fiehn, O.; Tyo, K. E. J.; Henry, C. S. J. Cheminform. 2015, 7, 44. Yousofshahi, M.; Manteiga, S.; Wu, C.; Lee, K.; Hassoun, S. BMC Syst. Biol. 2015, 9, 94. Boyer, S.; Arnby, C. H.; Carlsson, L.; Smith, J.; Stein, V.; Glen, R. C. J. Chem. Inf. Model. 2007, 47, 583–590. Langille, M.; Zaneveld, J.; Caporaso, J. G.; McDonald, D.; Knights, D.; Reyes, J.; Clemente, J.; Burkepile, D.; Vega Thurber, R.; Knight, R.; Beiko, R.; Huttenhower, C. Nat. Biotechnol. 2013, 31, 814–821. Aßhauer, K. P.; Wemheuer, B.; Daniel, R.; Meinicke, P. Bioinformatics 2015, 31, 2882– 2884.
ACS Paragon Plus Environment
Page 16 of 17
Page 17 of 17
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
TOC Figure
ACS Paragon Plus Environment