Enabling Efficient and Confident Annotation of LC ... - ACS Publications

Aug 25, 2016 - ...
0 downloads 0 Views 1MB Size
Subscriber access provided by United Arab Emirates University | Libraries Deanship

Article

The 1-SToP Approach to Annotation of LC-MS Metabolomics Data Corey David Broeckling, Andrea Ganna, Mark Christopher Layer, Kevin Brown, Ben Sutton, Erik Ingelsson, Graham Peers, and Jessica E. Prenni Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.6b02479 • Publication Date (Web): 25 Aug 2016 Downloaded from http://pubs.acs.org on August 29, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Analytical Chemistry is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 22

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

1

The 1-SToP Approach to Annotation of LC-MS Metabolomics Data

2 3

Corey D* Broeckling; Andrea Ganna; Mark Layer; Kevin Brown; Ben Sutton; Erik Ingelsson; Graham Peers; Jessica E Prenni*

4 5

CDB: [email protected], ph: 970-491-2273; Proteomics and Metabolomics Facility, Colorado State University, C-121 Microbiology Building, 2021 Campus Delivery, Fort Collins, CO 80523.

6 7 8

AG: [email protected], ph: 617-643-3291; Program in Medical and Population Genetics, Broad Institute of MIT and Harvard and Analytical and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, USA, 02114.

9 10

ML:, [email protected], ph: 970-491-5117; Research Software Facility, Colorado State University, Fort Collins, CO 80523.

11 12

KB: [email protected] , ph: 970-402-4013; Research Software Facility, Soil and Crop Sciences, Colorado State University, Fort Collins, CO 80523.

13 14

BS: [email protected], ph: 970-491-7846; Research Software Facility, Soil and Crop Sciences, Colorado State University, Fort Collins, CO 80523.

15 16

EI: [email protected], ph: 650-656-0089; Department of Medicine, Division of Cardiovascular Medicine, Stanford University School of Medicine, Stanford, California, USA, 94305.

17 18

GP: [email protected], ph: 970-491-6868; Department of Biology, Colorado State University, Fort Collins, CO 80523.

19 20

JEP: [email protected], ph: 970-491-0961; Proteomics and Metabolomics Facility, Colorado State University, C-121 Microbiology Building, 2021 Campus Delivery, Fort Collins, CO 80523.

21 22

* Author to whom correspondence should be addressed. [email protected]. 970-4912293.

23 24 25

Keywords: annotation, metabolomics, theoretical spectra, metabolite annotation, mass spectral libraries

26 27

ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Abstract Liquid chromatography coupled to electrospray ionization mass spectrometry (LC-ESI-MS) is a versatile and robust platform for metabolomic analysis. However, while ESI is a soft ionization technique, insource phenomena including multimerization, non-proton cation adduction, and in-source fragmentation complicate interpretation of MS data. Here, we report chromatographic and mass spectrometric behavior of 904 authentic standards collected under conditions identical to a typical nontargeted profiling experiment. The data illustrate that the often high level of complexity in MS spectra is likely to result in misinterpretation during the annotation phase of the experiment and a large overestimation of the number of compounds detected. However, our analysis of this MS spectral library data indicates that in-source phenomena are not random, but depend at least in part on chemical structure. These non-random patterns enabled predictions to be made as to which in-source signals are likely to be observed for a given compound. Using the authentic standard spectra as a training set, we modeled the in-source phenomena for all compounds in the Human Metabolome Database to generate a theoretical in-source spectrum and retention time library. A novel spectral similarity matching platform was developed to facilitate efficient spectral searching for non-targeted profiling applications. Taken together, this collection of experimental spectral data, predictive modeling, and informatic tools enables more efficient, reliable, and transparent metabolite annotation.

18

ACS Paragon Plus Environment

Page 2 of 22

Page 3 of 22

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

Analytical Chemistry

Introduction

2 3 4 5 6 7 8 9 10 11 12 13

Non-targeted metabolite profiling by liquid chromatography coupled to mass spectrometry (LC-MS) is a powerful analytical approach for investigating global changes in metabolism1-3. However, there remain significant limitations in our ability annotate signals from LC-MS data – that is to efficiently and effectively assign metabolite structures to the mass spectral features observed in the raw data. Traditional workflows for metabolite annotation involve an initial annotation of molecular features based on accurate mass measurements and/or predicted molecular formula. Ideally, putative metabolite annotations are confirmed by applying tandem mass spectrometry (MS/MS) to the feature(s) of interest followed by comparison to a database of MS/MS spectra and/or analysis of an authentic standard. This process is time consuming and costly and hence often infeasible or impractical. Together, these challenges have constrained the growth of metabolomics as a field and prevent full utilization of most metabolomics datasets, as unannotated compounds are difficult to interpret biologically.

14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

In recent years, there has been significant progress in the development of chemical databases which can be used to support annotation and interpretation of metabolomics data. These databases typically contain thousands to millions of chemical structures. For example, the Human Metabolome Database (HMDB4) contains information on over 40,000 known and predicted human metabolites. Similarly, METLIN5, CheBI6, MassBank7, and LipidMaps8 are invaluable resources that have significantly improved annotation accessibility and efficacy. To use metabolite structure databases in metabolomics annotation, one must first interpret the mass spectrum – the molecular weight of the analyte in question must be inferred from the spectral data before the search can be confidently performed. Some databases also provide experimental mass spectrometry data; these databases house MS/MS spectra of authentic compounds and enable direct comparison of MS/MS spectra from unknown compounds against spectra derived from authentic compounds. However, the size, complexity, and species-dependency of the metabolome and the limited availability and prohibitive cost of standards will limit the breadth of such MS/MS spectral libraries for the foreseeable future. Additionally, the acquisition of spectra from authentic reference standards is typically achieved via infusion of the standard compound in a neat solution coupled to precursor ion selected MS/MS fragmentation of the isolated [M+H]+ or [M-H]- ions, (occasionally other mass adducts and neutral losses are selected) at multiple collision energies. Thus, the spectral libraries are acquired under different conditions than the initial stages of a metabolomics experiment. This discordance prevents effective use of MS/MS spectral libraries at the discovery stage of the experiment, as comprehensive MS/MS coverage is still not feasible9. Recent efforts toward MS/MS prediction10 and direct searching of structure databases from MS/MS data11 are highly encouraging, but throughput is often limiting, preventing comprehensive metabolome annotation in most studies.

36 37 38 39 40 41 42 43

To overcome the limitations described above, we generated an authentic standard spectral library run under experimental conditions identical to metabolomics samples, which contain both MS and indiscriminant MS/MS12 data for 904 compounds. These dataare mined and modeled to develop a novel approach to metabolite annotation requiring only MS and retention time data from the initial discovery dataset: we call this approach 1-SToP (annotation from MS1-Spectrum and Time Predictions). To enable broad access to the workflow, we then developed a spectral search tool to enable batch searching, interactive exploration, sharing, and exporting of metabolite annotations based on spectral matching. All elements of these workflow are freely available.

ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

Methods:

2

Chemicals and Standards:

3 4 5 6 7

LC-MS grade Methanol and Water were purchased from Fisher. Formic acid and Leucine Enkaphalin (lockmass) was purchased from Sigma. Nearly 800 of the standards in this library were acquired as part of as the Human Metabolome Library through the Human Metabolome Database2. Prostaglandins were purchased from Cayman as part of a screening kit. Lipids were purchased through Avanti. Remaining standards were purchased primarily through Sigma.

8

Analytical conditions:

9 10 11 12 13 14

LC-MS data were acquired on a Waters Acquity UPLC system with a BEH C8 Column running a water methanol gradient, as previously described3. The LC was coupled to a Waters Xevo G2 TOF running in via electrospray ionization acquiring in positive MSe mode with a cone voltage of 30 and a capillary voltage of 2.2 kV. Library data was collected over the course of about 1.5 years and a certain amount of retention time instability is to be expected. In practice this drift was observed to be less than 0.15 minutes.

15

Library Processing:

16 17 18 19 20

Raw data was processed using XCMS in the R environment. Signals were then filtered and grouped to ensure that only signals derived from the standards of interest were collected and compiled into the library. Custom R scripts were used to visualized and manually QC all spectra, which were then written to file in .msp format and converted to NIST database format using the lib2NIST tool from NIST. Details are provided in the supporting online materials.

21

Library analysis and modeling of in-source phenomena:

22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

Library spectra were summarized in R. The smiles structure of the compound was processed using RCDK11, the monoisotopic molecular mass was retrieved, and all signals for the in-source phenomena including [M+H]+, [M+Na]+, [M+K]+, [2M+H]+, [2M+Na]+, [2M+K]+, [M+H-H2O]+, [M+H-2H2O]+, [M+HNH3]+, [M+H-2H]+, [2M+H-H2O]+, [2M+H-2H]+, [M+H+Methanol]+, and [M+H+Acetonitrile]+. The absolute intensities were collected for each MS spectrum. The absolute intensity was then converted to relative intensity for all the collected signals described above, such that the total collected intensity summed to 1 for any given spectrum. A smiles version of the chemical structure for all the library spectra were collected. These smiles structures were analyzed using the R package RCDK, with predicted chemical properties being tabulated for all compounds in the library using all chemical descriptors available descriptors. This dataset was then parsed, keeping only those columns for which there were no missing values and for which there was a non-zero standard deviation. Random Forest models were fitted from the randomForest R package12. CV based training control was used with n=5, and a mtry sequence from 1:50. The mtry value from the best model was used for fitting the final model. The models were trained on random selected subset of 64% of the library. The remaining 36% were divided between validation (20%) and a final test set (16%). The model fit was evaluated based on the actual vs predicted of the validation set only. The test set was reserved for final testing in RAMSearch. For prediction of in-source and retention time values for compounds from HMDB, all compound SMILES were downloaded, interpreted using RCDK as above, and the fit model as described above were applied. The resulting predicted values were then compiled into .msp and NIST database

ACS Paragon Plus Environment

Page 4 of 22

Page 5 of 22

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

1 2

format as described above. Cosine similarity between actual and predicted intensities were calculated using the cosine function in the ‘sos’ package.

3

RAMSearch:

4 5 6 7

RAMSearch is a graphical user interface written in C# which utilizes the NIST dll files and msPepSearch tools to search and retrieve spectra. A more detailed description of the program is provided in the supporting information online. Further, a short tutorial document and an executable version of the program are also available as supplementary material.

8

Final evaluation of RAMSearch against a theoretical HMDB library:

9 10 11 12 13 14 15 16 17 18 19 20 21 22

To test the validity of using a theoretical in-source and retention time library for annotation of LC-MS data at the MS1 level, we utilized all of the compounds from our reserved ‘test’ set (see above) which had matching structures in the theoretical HMDB in-source database. The 99 test set library spectra (surrogates for ‘unknown’ spectra) were then searched against the theoretical HMDB with a retention time tuning value set to 45 to account for the imprecision of the retention time prediction. The search result ranks were tabulated using three methods. The first was against the full database (with retention time and adduct and neutral loss intensities derived directly from the models). For the second comparison, we utilized the same library, but set the ‘retention time tuning value’ to 100,000 seconds, effectively setting all retention time similarities to 1. This results in the same comparison, with the exception that retention time has no selective contribution to the total score. For the final comparison, a new library was created, for which each adduct and neutral loss signal intensity was randomly sampled from the actual distribution of predicted values for that adduct/neutral loss event. In all cases, the combined similarity score was used for determining the rank of the correct match. The distribution of matches was plotted as a cumulative distribution plot in R.

23 24

Results and Discussion:

25

Description of the in-source spectral and retention time library:

26 27 28 29 30 31 32 33 34 35 36

UPLC-coupled time of flight MS data (positive ionization) was acquired for 904 authentic small molecules standards from a wide range of chemical classes including amino acids, aromatic acids, carbohydrates, dicarbocylic acids, steroid and steroid derivatives, hydroxy acids, alcohols and polyols, oxylipins, and fatty acids, representing both mammalian and plant metabolomes (see supplemental files, cmpd_summary.csv). For each standard, retention time, in-source (low energy) MS spectra, and idMS/MS13 (indiscriminant MS/MS acquisition, high energy) spectra were collected under conditions identical to that used in a metabolomics experiments. This unique library enables confident metabolite annotation of common biological compounds without the need for follow-up MS/MS experiments. Additionally, this library provides an opportunity to study the behavior of authentic standards under experimental conditions, a process which can inform interpretation of experimental spectra from unknown structures.

37 38 39

The compounds in the library span a broad range of hydrophobicity (Figure S1a, median RCDK-predicted XLogP = 1.838) and molecular weight (Figure S1b, median molecular weight = 303.147). As reverse phase chromatography is largely a hydrophobicity-based separation, a range of retention times

ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1 2 3 4 5 6

spanning the full chromatographic gradient (Figure S1c) is observed and the expected relationship between XLogP and retention time is apparent (Figure S1d). The outliers on in figure S1d largely represent two classes of compounds, porphyrins and glycerophosphoglycerol lipids, suggesting that deviations from simple hydrophobicity-based predictions are structurally driven. This library is provided as supplementary online material in msp (text) and NIST database formats to serve as a resource for the broader metabolomics community.

7

In-source spectral patterns reflect chemical structural properties:

8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

To better understand in-source mass spectral phenomena, we summarized the spectral signals for each compound, tabulating the signal intensity of the common and expected adducts ([M+H]+, [M+Na]+, [M+K]+) as well as the less frequently considered ([M+H-2H20]+, [M+H-ammonia]+, [2M+K]+, [M+H2H]+…). While not an exhaustive list these cover the most frequently observed phenomena. Examination of these signals revealed that the in-source spectral patterns observed were not random and frequently reflected compound structure. For example monoacylglycerol lipids, the base peak insource ion was observed primarily as [M+Na]+ with a smaller intensity signal for [M+H]+ and a clear signal representing the loss of water [M+H-H2O]+ (Fig 1a, left panel). Extremely low intensities for potassium adducts were observed for all three monoacylglycerols. On the contrary, the pattern for the analogous fatty acids (Fig 1a, right panel, same chain lengths and double bonds) show a distinct pattern in which the in-source signals were dominated by [M+K]+. For the saturated fatty acids (FA(12:0), FA(16:0)), the [M+H]+ signal was rather strong, while for the unsaturated FA(18:2) (linoleic acid) standard, [M+H]+ it was not observed and instead replaced by a signal representing the loss of water [M+H-H2O]+. These patterns held across a several compounds within these classes, the members of which differed in carbon chain length and level of desaturation (see full library, supplementary online data). Further, the patterns displayed by the monoacylglycerols are also represented by the diacylglycerols, suggesting that the low potassium affinity for the mono- and di-acylglycerols is due to the masking of the carboxylic acid group rather than the presence of hydroxyls.

26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

These observations prompted a more global visualization to explore whether distinctive patterns were observed for other classes of compounds. The normalized and log10 scaled intensities of these signals are displayed as a heatmap (Figure 1b, high resolution fully labelled version as Figure S2), with bidimensional hierarchical clustering. This display clearly indicates that there are many compounds for which there is essentially no detected [M+H]+ signal (black bars), that there are distinct ionization patterns visible, and that these patterns appear at least partially dependent on the structures of the compounds, as similar classes of compounds cluster together in this display (Figure S2). Dimers are frequently encountered, not only as protonated adducts (twoMH), but as sodiated (twoMNa) and potassiated (twoMK) dimers. Generally, when a monomeric [M+Na]+ or [M+K]+ is observed, we were more likely to also observe the dimeric metal adduct signal (monomer to dimer Spearmans rank ρ= 0.34, 0.61, 0.67 for H, Na, and K, respectively; all p 0.8, indicating an aggregate accuracy in the model predictions which exceeds the predictive power for individual peak predictions. If we compare this result to cosine similarities from randomly paired spectral intensities (actual vs predicted), the distribution shifts toward a bimodal distribution with densities near cosine similarity values of 0.9 and 0.1. The density at high similarities is driven by compounds of a similar structure which also possess similar adduct/neutral loss patterns (Figure 2b). This apparent lack of specificity actually indicates that chemical structure impacts in-source patterns, and therefore randomly paired spectral intensities from similar structures will have high spectral cosine similarity. Despite the bimodal distribution of randomly paired spectra, authentic cosine similarities were significantly higher than randomly paired similarities (t-test p