Subscriber access provided by UB + Fachbibliothek Chemie | (FU-Bibliothekssystem)
Article
A pattern recognition based approach for identifying metabolites in NMR based metabolomics Abhinav Dubey, Annapoorni Rangarajan, Debnath Pal, and Hanudatta Sastry Atreya Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.5b00990 • Publication Date (Web): 23 Jun 2015 Downloaded from http://pubs.acs.org on June 25, 2015
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
Analytical Chemistry is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
A pattern recognition based approach for identifying metabolites in NMR based metabolomics Abhinav Dubey1, 2, Annapoorni Rangarajan3, Debnath Pal1, 4 *, Hanudatta S. Atreya2, 5 * 1
IISc Mathematics Initiative, Indian Institute of Science, Bangalore 560012, India
2
NMR Research Centre, Indian Institute of Science, Bangalore 560012, India
3
Molecular Reproduction, Development and Genetics, Indian Institute of Science, Bangalore 560012, India 4
Supercomputer Education and Research Centre, Indian Institute of Science, Bangalore 560012, India 5
Solid State and Structural Chemistry Unit, Indian Institute of Science, Bangalore 560012, India
* Corresponding authors: Debnath Pal [
[email protected]] Hanudatta S. Atreya [
[email protected]]
1 ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 2 of 27
Abstract Identification and assignments of metabolites is an important step in metabolomics and is necessary for the discovery of new biomarkers. In nuclear magnetic resonance (NMR) spectroscopy based studies, the conventional approach involves a database search, wherein chemical shifts are assigned to specific metabolites using a tolerance limit. This is inefficient because deviation in the chemical shifts associated with pH or temperature variations as well as missing peaks impairs a robust comparison with the database. We propose here a novel method which is based on matching the pattern of peaks rather than absolute tolerance thresholds, using a combination of Geometric Hashing and similarity scoring techniques. Tests using 719 metabolites from the Human Metabolome Database (HMDB) show that 100% of the metabolites can be assigned correctly when accurate data is available. A high success rate is obtained even in presence of large chemical shift deviations such as 0.5 ppm in 1H and 3 ppm in 13C and missing peaks (up to 50%), compared to nearly no assignments obtained under these conditions with existing methods which employ a direct database search approach. The method was evaluated on experimental data on a mixture of 16 metabolites at eight different combinations of pH and temperature conditions. The pattern recognition approach thus helps in identification and assignment of metabolites independent of the pH, temperature and ionic strengths used, thereby obviating the need for spectral calibration using internal or external standards.
2 ACS Paragon Plus Environment
Page 3 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
Introduction In recent years, metabolomics has emerged as an important tool in a wide range of studies, such as systems biology, drug discovery and pharmacokinetics1-6. Most metabolomics studies have goals of identifying and/or quantifying the metabolites present in a complex mixture. While identification of metabolites allows a direct mapping and identification of related biological pathways, quantification helps in understanding their prevalent flux in the studied samples. Two major analytical techniques, namely Nuclear Magnetic Resonance (NMR) and Mass spectrometry (MS) are employed for this purpose. Identification of metabolites in NMR typically involves methods like spiking, spectral deconvolution and/or multidimensional NMR experiments7,8. The spectral data can be searched against several databases such as HMDB (Human Metabolome Database)9, BMRB (Biological Magnetic Resonance Bank)10, MMCD (Madison Metabolomics Consortium Database)11, MetaboMiner12, COLMAR (Complex Mixture Analysis by NMR)13 that are archiving repositories of chemical shifts for numerous metabolites. A one dimensional (1D) NMR spectrum is unsuitable for assignment of metabolites for complex mixture in metabolomics due to frequent overlap of peaks. Efficient assignment of resonances in 1D NMR mandates assistance from 2D NMR experiments15. This is because 2D/3D NMR spectrum resolves overlapping peaks by dispersing them in the indirect dimensions14. Thus acquiring 2D NMR spectrum in addition to 1D is gaining importance in the field of NMR-based metabolomics8,16. Popular NMR experiments such 2D [13C, 1H] heteronuclear single quantim correlation (HSQC) are commonly leveraged for such studies owing to its advantage of having high chemical shift dispersion of
13
C in its indirect dimension and
13
C 1H chemical shifts
correlation. The chemical shifts obtained therein can be assigned to metabolites by querying against archived peak lists stored in databases such as HMDB, BMRB, MMCD, MetaboMiner 3 ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 4 of 27
and COLMAR. These databases use their own query algorithms, based on matching peaks within a user defined tolerance limits, along with other parameters like ‘uniqueness’ of peak (MetaboMiner)12 and ‘assignment uniqueness value’ (COLMAR)13. This poses a major shortcoming for practical purposes, as experimental conditions such as pH, temperature etc., at which NMR spectra are acquired by the user are almost never identical to the “standard” conditions at which the metabolite data are acquired for the databases. Therefore, all users are expected to calibrate their NMR experiments to “standard” conditions of the database to facilitate efficient match between experimental and archived data, based on tolerance and related search parameters. The nature and extent of the spectral variations posed by external conditions on NMR experiment can be daunting. While the spectra of pure metabolites in database are generally recorded under “standard” conditions of pH = 7 and temperature = 298 K, the deviation in the observed chemical shift values in actual experiments from the database can have both systematic and non-systematic components. It is instructive to note that chemical shifts deviations from database values due to pH variations are quite common in metabolomics sample from cancer cells17 (pH varies from 5.8 to 7.6), urine18 and blood serum15. The other major difficulty is posed by missing peaks in NMR spectrum, which render a spin system incomplete resulting in ambiguous assignments when comparing it with spin systems in the database13. Systematic deviations can be taken care by calibrating the entire spectrum, which requires prior knowledge of chemical shift of internal reference compound or addition of an external compound for referencing. However, calibration becomes difficult for spectra recorded at low resolution in the indirect dimension of a 2D spectrum. Further, addition of external reference is at times undesirable because their chemical shifts are susceptible to change at low pH and/or may have 4 ACS Paragon Plus Environment
Page 5 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
possible interactions with other solute molecules19,20. Non-systematic shifts are more difficult to address. These deviations arise from variations in pH change, temperature change, NMR instruments, ionic strength of the solvent and errors in peak picking. Indeed to address this problem, Chikayama et al. have suggested creation of an in house database as a remedy21. A glimpse on the extent of the systematic and non-systematic deviation of observed chemical shifts from those in the database can be seen from Figure 1a. The figure depicting 2D [13C, 1H] HSQC spectrum of 16 metabolites mixture recorded at pH 7 and temperature 298K clearly differentiates the experimentally observed chemical shifts from those recorded in the HMDB repository. The chemicals shifts in HMDB are almost 3 ppm off in the 13C and 0.05 ppm off in the1H dimension compared to our recorded spectrum. The difference is to be noted with the background that both our spectrum and that archived in HMDB are supposedly recorded at the identical “standard” conditions. Figure 1b further highlights how these deviations are unevenly distributed with variation of temperature and pH. If we use wider tolerance limits of 0.12 ppm in 1H and 3 ppm in
13
C as implied by the distribution in Figure 1b to estimate how
many peaks each of the 6587 chemical shift values archived in HMDB database would have in its tolerance neighborhood, we find a remarkably high two thirds of chemical shifts with more than 100 peaks in their neighborhood (Figure 1c). This suggests that increasing the tolerance limit to accommodate the systematic and non-systematic shifts may increase the scan coverage, but on the other hand it diminishes the chance of obtaining the correct assignment of metabolites due to significantly larger number of contesting false candidates. This highlights the major lacunae of using a chemical shift tolerance-based approach to search metabolites in databases for NMR metabolomics.
5 ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 6 of 27
Figure 1: (a) Comparison of spectra acquired on a 700 MHz NMR spectrometer for a mixture of 16 metabolites (red contours), with the HMDB chemical shifts (marked as + sign with single letter amino acid code). (b) Distribution of deviations of 1H and 13C chemical shifts of 16 metabolites at temperature 298 K and 308 K and at pH 4,5,6,7 and 8 from the values stored in HMDB. (c) Distribution of number of peaks present in the neighborhood of an individual peak in HMDB within the tolerance radius of 0.53 ppm and 3.3 ppm in 1H and 13C dimension, respectively.
We propose here a novel database search approach which addresses the aforementioned problems and takes care of both systematic and non-systematic deviations in the observed
6 ACS Paragon Plus Environment
Page 7 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
chemical shifts of metabolites relative to that in the database, and is also robust to missing peaks. We call this method as PROMEB (Pattern RecOgnition based assignment for MEtaBolomics). This approach leverages the mathematical technique of Geometric Hashing, which is extensively used in the area of image processing22,23. The idea is to take the archived data (peak list) corresponding to each metabolite from the database and generate patterns using the Geometric Hashing technique. Subsequently, when a spin system observed in the NMR spectra is taken as query, its peak list is converted to a set of patterns using Geometric Hashing and matched with the database derived patterns. The matching metabolites are then ranked in priority using voting, peak ratio thresholding and similarity scoring. The spin systems can be derived from any kind of NMR spectrum (2D/3D 1H/13C Correlated spectroscopy (COSY), Total Correlation Spectroscopy (TOCSY) etc.), as long as the corresponding type of chemical shift data is available for matching in the archived form in the database. We demonstrate our method by applying Geometric Hashing to obtain the peak pattern of metabolites in the 2D [13C, 1H] HSQC spectrum to compare with the peak pattern of metabolite present in the HMDB database. The basic premise of our method rests on the fact that even though the peak positions of metabolites in actual experiments may have shifted beyond the set tolerance level, the possibility that the peak pattern as a whole still carries the information about the metabolite is very high. We demonstrate the robust performance of our method in two ways: (i) by testing the recall efficiency of 719 metabolites in the Human Metabolome Database under various search and simulated data conditions, such as 050% peak deletion and addition of noise in form of chemical shift deviations and (ii) test of recall using real query spin systems from experimentally acquired 2D [13C, 1H] HSQC spectra acquired for a sample containing a mixture of 16 metabolites at four different pH and two
7 ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 8 of 27
distinct temperature values. We acquired a 2D [13C, 1H] HSQC-TOCSY spectrum to establish the spin systems. Theory Geometric Hashing as basis for PROMEB Every NMR active nuclei in a metabolite gives a resonance peak in a NMR spectrum and the collection of such peaks coming from given metabolite constitute a spin system. Thus the terms 'metabolite' and 'spin system' are used interchangeably. This collection of peaks from a spin system presents a pattern that may be unique to the metabolite. In situations where peak(s) in a spin system is not fully detected in spectrum, the uniqueness of the pattern may be lost. The primary motivation of using Geometric Hashing is to suitably process such peak patterns corresponding to archived metabolites, to match the same derived from the query NMR spectra. Our method can be divided into two stages - a preparatory stage and an evaluation stage. This is illustrated in Figure 2 and explained below in detail. The preparatory stage involves creating a chemical shift pattern of each metabolite using the concept of Geometric Hashing and the chemical shifts available for that metabolite in the HMDB and embedding it in the hash table. In general, a hash table has unique keys in first column pointing to data in subsequent columns for that row. We next explain the creation of hash table with keys and pointing to metabolites’ name as data (Figure 2a, 2b). Normally, all the metabolite data from database would be preprocessed only once and stored in the hash table for recurring use during query. In contrast, the hash table for the query spin system would be generated on the fly during search.
8 ACS Paragon Plus Environment
Page 9 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
Figure 2: (a) Schematic of Geometric Hashing algorithm applied to our particular problem. (i) An example metabolite 1 (m1) with three peaks (A, B, C) from its spin system is used to demonstrate the steps in ii, iii and iv. In each of these steps, two peaks are chosen as basis and suitably scaled and aligned to the frame of reference (-1, 0) and (1, 0) so that the transformed third peak may be written in the new basis. The new coordinates of the third peak forms a 'key' in the hash table. The pairs of peaks are chosen in turn and all the three peaks are converted to keys. Thus a metabolite with ‘n’ peaks will give rise to maximum of 1 2⁄2 keys in the hash table. (b) An example hash table with keys generated from peak pattern of metabolite 1 (m1) using Geometric Hashing is shown. (c) Schematic of automated assignment method which needs the spin system input by the user and compares the chemical shift pattern in the spin system with the metabolites in the database. In preparatory stage, peaks from query metabolite and target metabolites are subjected to Geometric Hashing algorithm to build hash tables to facilitate geometric pattern search. The hash table from the database is computed once and stored. The hash tables derived from the query and the database are used as input to the
9 ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 10 of 27
evaluation stage. The mutually exclusive series of filtering steps of vote threshold, and PR score (≥ 1) is put to use for screening and similarity score for ranking the target metabolites.
There are several steps involved in creating the hash table. For a given metabolite or spin system, two peaks are chosen at a time as the 'basis'. The distance between these two peaks is normalized to 2 ppm and all other peaks are reassigned the new transformed chemical shift coordinates in this new basis system (Figure 2a). The first operation cancels the differences caused by any affine transformations during later comparisons in the evaluation stage. The second operation takes care of scaling. Scaling is enlargement or diminishing of pattern. In the context of present method, adjusting scales helps to overcome non-systematic deviations. The new coordinates of peak are quantized by rounding to one decimal place and this is termed as a 'key' in the hash table. We performed test simulations and found that rounding to one decimal place is optimal. Rounding to two decimal places creates a large number of keys with no added advantage. On the other hand, upon rounding to nearest integer and in presence of large deviations, keys frequently do not find corresponding match in the database. The 'key' is meant to point to the spin system or the corresponding metabolite (Figure 2b). In reality these are index values that point to set of spin systems or metabolites. Thus a spin system/metabolite with ‘n’ peaks will give rise to maximum of 1 2⁄2 keys in the hash table. The hash table captures the peak pattern of all the metabolites in the database. Thus, the absolute chemical shifts of individual atoms of the spin system/metabolite are no longer used for assignments, but instead the peak pattern of the entire spin system/metabolite is used for matching. Two spin systems/metabolites can share same key in which case the key will point to both (Figure 2c). Since two peaks are used to define a basis, this eliminates metabolites/spin systems with less than three peaks from Geometric Hashing, rendering them unsearchable. 10 ACS Paragon Plus Environment
Page 11 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
After indexing the input spin system and metabolites from both the query and the database, respectively, the available patterns in the hash tables are evaluated in three sub stages (Figure 2c). Evaluation Stage 1 (Vote threshold): As seen above, each individual metabolite consists of a set of keys encoding its peak pattern in the hash table. In this stage of evaluation, the query keys are matched against the keys of each metabolites and the percentage of keys of each metabolite that match with that of the query is denoted as a "vote percentage" (i.e., percentage calculated with respect to the keys a given metabolite has in the database). At the end of comparison a list of metabolites with percentage of votes (or match) above a threshold is created. For example, consider a target metabolite which has 4 peaks. Based on formula 1 2⁄2 it will have 12 keys in the hash table. If the keys of query metabolite match 8 of these keys (out of 12) of target metabolite, the target metabolite gets 66.67% votes. The target metabolites which have vote percentage below some user defined threshold are filtered out. The vote threshold should be chosen so as to accommodate the reduction in vote percentage caused by missing peaks. The percentage reduction in vote with percentage peaks missing for HMDB metabolites is shown in Figure S1.
Evaluation Stage 2 (Peak Number Ratio/PR Score): Very often due to missing peaks, the number of chemical shifts observed for a metabolite in the NMR spectrum will be less than or equal to that available in the database. This implies that target metabolite (in database) should have more than or equal to the number of keys for the metabolite being queried. For every metabolite which is obtained as a 'hit' for the query metabolite, we compute the ratio of number
11 ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 12 of 27
of keys in target metabolite to the number of keys in query metabolite. This ratio of number of keys (denoted as PR score) should be ≥ 1, which is used as a threshold for further short listing the candidates. Thus, metabolites from the database which match the query metabolite peak pattern (embedded in the hash table), but with number of keys less than that observed in the experimentally observed spin system (query) are discarded from possible assignments. Evaluation Stage 3 (Similarity score): Geometric Hashing compares only the peak pattern and is indifferent to the location of this pattern in the spectrum. For example, two or more metabolites in the database may share the same peak pattern with the query spin system, but they may differ widely in their chemical shift values. Therefore, to finally rank the screened metabolites based on peak’s location, we compute similarity score between query and target metabolite based on their chemical shifts. We heuristically calculate the shortest Euclidean chemical shift distance between the query metabolite and all the possible shortlisted metabolite candidates to obtain a similarity score (Text S1):
…. Equation (1) where 'i' is the index number of chemical shifts being matched and δH and δC are chemical shifts of 1H and is 13C, respectively.
Preparatory stage and Evaluation stage, partitioned into three sub stages are schematically shown in Figure 2c. In summary, once a set of metabolites are identified which match the chemical shift pattern of the query spin system, their chemical shifts are used for ranking/prioritizing the identified metabolites. Similarity score alone would lead to inefficient ranking of correct metabolite and the two filtering steps play essential role in removing undesired contestants as demonstrated with an example in Figure S2. 12 ACS Paragon Plus Environment
Page 13 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
Materials and Methods Benchmarking To test our method, we used chemical shifts of 846 metabolites present in HMDB as the primary source of information. First a dataset was created by extracting the 2D 1H and
13
C
chemical shift correlations of all the 846 metabolites. We generated six test datasets (A-F) from this extracted data. Dataset A had noise in form of systematic and non-systematic deviations added to the chemical shifts. To simulate systematic deviation, the offset of 0.5 ppm and 3 ppm was added to all 1H and 13C chemical shift values, respectively. For non-systematic deviation, a random number chosen between (0.01, 0.03) ppm and (0.1, 0.3) ppm was added/subtracted to 1H and 13C chemical shifts, respectively. Besides having these form of noise, 10%, 20%, 30%, 40% and 50% of peaks were deleted randomly from datasets B-F. These datasets (A-F) formed our test cases against the original HMDB chemical shift values for a metabolite. Each dataset was regenerated 40 times through random addition of noise, and used for testing. The metabolites were further grouped into 93 classes (Table S1) based on the annotations present in HMDB. The HMDB version 3.6 files were downloaded for analysis. NMR Experiments A mixture of 16 metabolites (Table S2) was prepared at different pH buffers namely 4, 5, 6, 7 and 8. We used 2D [13C, 1H] HSQC TOCSY experiment to identify the spin systems; however, it must be emphasized that our method takes 1H and
13
C chemical shifts of a spin
system obtainable from any set of NMR experiments as input. The 2D [13C, 1H] HSQC and 2D [13C, 1H] HSQC TOCSY were acquired at natural abundance of
13
C for these metabolites
mixture at temperatures 298 K and 308 K. The 2D [13C, 1H] HSQC was recorded on a Bruker 13 ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 14 of 27
Avance NMR spectrometer operating at 1H resonance frequency of 700 MHz and equipped with a cryogenic probe. The spectra were recorded with 8 scans and 256 complex points in indirect 13
C dimension and 2048 points in the direct 1H dimension. Spectral width (or spectral range,
used interchangeably) was 100 ppm with carrier frequency at 50 ppm was chosen for
13
C
dimension and 12 ppm with carrier frequency at 4.7 ppm for 1H dimension. The spectra were first zero filled to twice the number of points; cosine window function was used for apodization. The spectra were Fourier transformed followed by phase and baseline correction. The 2D [13C, 1
H] HSQC TOCSY was recorded on the same NMR spectrometer. All parameters were same as
those of 2D [13C, 1H] HSQC. Manual peak picking was carried out for each spectrum, but automatically picked peaks can also be used as input for this set of two spectra.
Results and Discussions Various factors can cause drift in chemical shifts of which pH and temperature are the most common ones. Figure 1b above illustrates the shifts with respect to data in the HMDB; similar deviations are observed for the same set of metabolites at different pH and temperature conditions relative to standard pH at 7 and ambient temperature of 298 K (Figure S3). It is interesting to note that there is a consistent drift of the chemical shifts when the temperature is increased to 308 K; however, the shifts associated to pH are not consistent. For example, the mean deviations for pH 4 and pH 5 are >0 and < 0, respectively. Therefore, the data suggests that varying pH and temperature can result in both systematic and non-systematic shifts. The wide fluctuation re-emphasizes the need for caution when using peak tolerance as the sole basis
14 ACS Paragon Plus Environment
Page 15 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
for metabolite identification. Use of peak pattern eliminates the said drawback and we used the above data to demonstrate the same below. The number of metabolites in HMDB currently eligible for screening (i.e., having >2 peaks) based on our approach is 719 (86% of the metabolites in the HMDB) for 200 ppm spectral width in
13
C dimension and 10 ppm spectral width in 1H and it reduces to 425
metabolites if 50% of peaks are deleted randomly in the simulations to mimic missing peaks (Table S3). As control we tested our method with chemical shifts of 719 metabolites without adding any type of noise; our method 100% correctly recalled all metabolites at rank 1. Next, the efficiency of the above screening procedure was tested for correct recall on metabolites in the HMDB after adding systematic and random noise, and various levels of peak deletions (as described in Materials and methods section). In other words, we created noisy spin systems from 2D [13C, 1H] shift correlations of all 719 metabolites. The results are shown in Figure 3. The percentage of correct assignments (that is the correct metabolite being identified within top 10 rank) is > 75% in all cases. Even in the case of 50% deletions of peaks and in presence of large non-systematic deviations we were able to correctly identify 60% of the metabolites within rank 10. We also defined class rank as the rank corresponding to the lowest ranked target metabolite from the same class. Based on this definition, we found that 72% to 78% of the metabolites are correctly assigned class within rank 5 (Figure 3). The method assigns more than 85% of the metabolites correctly within rank 30 and class based assignment is 92% correct. Note
15 ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 16 of 27
that these percentages are calculated with respect to metabolites eligible for our method (i.e., have more than 2 peaks), the values for which are different in each case as given in Table S3. We next compared this with identification of metabolites based on the presence of unique/signature metabolite peak, an approach used in software like MetaboMiner12 and COLMAR13. Unique peak of a metabolite is one which has no other peak from other known metabolites within a given tolerance limit. These unique peaks can be the signature peak for the metabolite which assists in assignment24.
16 ACS Paragon Plus Environment
Page 17 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
Figure 3: Plots showing the fraction of the metabolite classes (N=93) and individual metabolites correctly screened at 10%,20%,30%,40% and 50% peak deletions after addition of random noise. Each metabolite was scanned with chemical shifts within spectral width 200 ppm in 13C dimension and 10 ppm in 1H dimension. The curve shows cumulative percentage (and standard deviation) of metabolites found within the specified rank. The blue and the red line correspond to correct identification of the query metabolite, and its class, respectively.
The introduction of systematic deviation of 0.5 ppm and 3 ppm with non-systematic deviation added in terms of random noise between ± (0.01, 0.03) ppm and ± (0.1, 0.3) ppm for 1
H and 13C chemical shifts, respectively, coupled with peak deletion leaves only about 1% of the
metabolites in HMDB with any unique peaks; 1.5% being the maximum number of cases when no peaks are deleted (Figure 4). Therefore, a normal peak tolerance based search method intending to exploit uniqueness criteria has 2) under various peak deletions. Exact number of metabolites for each case is listed in Table S3.
17 ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 18 of 27
The reasonably high recall (20-80%) of metabolites when only one percent has unique peaks under simulated noise and peak deletion conditions demonstrates the true power of our method. The success rate is fairly well sustained even during higher peak deletions, though large numbers of metabolites are rendered ineligible for screening due to the mandatory >2 peaks criterion. It is to be noted that a metabolite can be missed from getting any rank either because it has less than 3 peaks or if it could not cross vote threshold. The percentage of metabolites missed; i.e., which cannot be shortlisted at the first place through vote threshold, increases with number of peaks deleted (Figure S4). This is due to the dramatic decrease in vote percentage for which it is expected to be shortlisted. The percentage is calculated with respect to only those metabolites which have more than 2 peaks. Thus the normalizing number decreases with increasing percentage of peaks deletion. If we look closely into our dataset, the systematic and non-systematic deviations together account for 0.53 ppm deviation in 1H and 3.3 ppm deviation in
13
C chemical shifts which is at par with the deviations expected from practical situations
(Figure 1b). In these circumstances, we need to fix the tolerance limit up to 0.53 ppm and 3.3 ppm in 1H and 13C dimension, respectively for labeling a peak as a unique/signature peak. Since, only 1.5% of HMDB metabolites have a unique peak in these limits, this appears to be the most likely threshold for current metabolite assignments using unique peaks for practical NMR metabolomics. In contrast, our method PROMEB with the same dataset is able to assign 46% metabolites within rank 5 and 80% within rank 30 because it uses relative patterns of peaks associated with metabolites, instead of the absolute peak positions. The latter is indeed a large reduction in metabolites search space (from 719 molecules to 30) with reasonably high accuracy, making it suitable for further manual intervention. Using visualization software like CCPN25,26, CARA27, and TOPSPIN, the peaks in spectra can be manually checked in the order of their rank
18 ACS Paragon Plus Environment
Page 19 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
of the molecules allowing for informed identification and assignment of metabolites. It is instructive to note that our test designs and simulations were carried out to assess how PROMEB performs in the worst case scenarios, and how it can still be a powerful and useful tool to a user for whom primary biological sample and the derived spectral data is most precious. With availability of accurate data, our method is able to return accurate results. Experimental test case The above method was applied to data acquired for 16 metabolites at pH values between 4 and 8, and two different temperatures (298 K, 308 K) (Figure 1a, 5a, 5b). Four metabolites had only two or less peaks and therefore did not qualify for screening by our method. But these metabolites were useful for testing the case if extra peaks in the spectrum hamper the assignments. For the remaining 12 metabolites, 73% are correctly identified within rank 5. About 20% of metabolites were identified between rank 6 and 10. We were also able to assign Glutamine and Glutamic acid which are known for overlapping peaks because of identifying spin systems using 2D [13C, 1H] HSQC-TOCSY. Glucose is the only metabolite which could not be assigned within top ten ranks. Metabolites assigned lower ranks to glucose are mostly glucose derivatives or some other carbohydrates; however class wise it gets a rank of 1 (Figure 5c). PROMEB vs HMDB HMDB webserver takes peak list and tolerance levels in both the dimension as input. We compared the performance of PROMEB with HMDB webserver. A tolerance value of ±0.05 ppm and ±0.5 ppm in 1H and
13
C dimensions, respectively, was used. Figure 5c displays the
search result. In column A and B, the chemical shifts values obtained from spectra are directly used. We found HMDB webserver (column B) could not find any correct metabolite while 19 ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 20 of 27
PROMEB (column A) performance is distinctly better. In column C the chemical shifts are calibrated for pH 7 and temperature 298 K to match with database values and searched using HMDB webserver. The performance of HMDB matches with PROMEB for 298 K but again it could not identify the correct metabolites for data acquired at 308 K. We also used 100 randomly chosen metabolites from one of the 40 simulated dataset C (described before). After adjusting the chemical (large shift in spectrum possibly due to miscalibration in equipment) in 13C dimension and using the general tolerance values, almost no metabolite can be identified using the HMDB webserver (Table S4). Thus PROMEB works better on uncalibrated (and even miscalibrated) spectra and does not need user discretion for choosing tolerance. On the other hand HMDB webserver requires calibrated spectra and user defined tolerance values.
20 ACS Paragon Plus Environment
Page 21 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
Figure 5: Experimentally acquired (a) 2D[13C, 1H] HSQC and (b) 2D [13C, 1H] HSQC TOCSY spectrum of a mixture of 16 metabolites at natural abundance of 13C at pH 7 and temperature 298 K. (c) Results of assignment of 16 metabolites mixture (NMR spectra recorded at different pH and temperature) using the method proposed. 12 metabolites have more than two peaks, suitable for identification using this method. Column A: Identification using PROMEB without calibration of peaks. Column B: Identification using HMDB webserver without calibration of peaks. Column C: Identification using HMDB webserver after calibrating the peaks. Tolerance used for HMDB webserver is ±0.05 ppm and ±0.5 ppm in 1H and 13C dimensions, respectively. 21 ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 22 of 27
The green circles indicate metabolites ranked within rank 5, blue triangles indicate metabolites within ranks between 5 and 10, golden squares indicate metabolites with rank greater than 10 and hence classified as incorrectly assigned. The red cross indicates the metabolite was not found in the result.
PROMEB has several advantages that are not currently leveraged in NMR based metabolomics. First, it is capable of providing a hit for a novel spin system which is not present in the database. To test this we created a dataset of novel metabolites. We randomly chose 30 metabolites (Table S5) and they were not indexed in the database hash table. Systematic and non-systematic noise was added as described previously with no peaks deleted. Using our method their chemical shifts were then queried against the rest 816 metabolites to identify their class. PROMEB could correctly assign class of 13% (4 out of 30 novel metabolites) at rank 1. It was able to correctly identify class of 60% novel metabolites within rank 5. As can be seen in Table S5 the rank received by novel metabolites does not correlate with number of metabolites in that particular class. This indicates that the class ranking is not affected by high or low representation of metabolites from the class. In such cases this method can contribute effectively in identifying the class of the novel metabolite. Second, our method is suitable for high throughput screening as each part of the computer program implementing the method can be parallelized, including Geometric Hashing. Third, the nature of the method is modular, so that we can also include additional screening stages, and also experimental data to boost the recall accuracy. Indeed more NMR experiments such as 2D [1H, 1H] TOCSY, 2D [1H, 1H] COSY can be recorded and the pattern of chemical shifts can be matched against the database values. Database with chemical shifts of metabolites from these experiments are very recently available28. Confidence in assignment of given spin system will increase on combining chemical
22 ACS Paragon Plus Environment
Page 23 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
shift pattern of spin system from several different experiments14. This will allow the method to improve and become powerful over time for larger and more complex NMR-based metabolomics studies. To best of our knowledge all methods of resonance assignment in metabolomics are based on tolerance limits. Generally some calibration is needed to the peaklist before submitting it as query to the existing methods. In our method we do not use any tolerance limit as a user parameter. Our method obviates the need of such calibration or an addition of external calibrating agent, which is undesirable at times19,20
Conclusions A new automated method for identification of metabolites (PROMEB) is proposed which utilizes the peak pattern of metabolites rather than their absolute chemical shift values and is robust to systematic and non-systematic deviations in chemical shifts from reference values present in the database. This method of assignment obviates the need of spectral calibration using internal or external standard(s). The basis of algorithm is such that evidences can be accumulated to increase the confidence in assignment. For example, spin system patterns of same metabolite from different NMR experiments (such as 2D 1H/13C COSY or TOCSY) can be compared and the metabolite ranking from all the experiments can be combined to increase the confidence level in identifying the metabolites. With the availability of this approach, the focus can shifted to enriching the public databases with more chemical shifts at standard conditions. The experiments can be done for any sample at any condition and information present in the
23 ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 24 of 27
database can be readily used for assignments using PROMEB and thus sample specific databases are not needed. PROMEB can be freely downloaded from http://nrc.iisc.ernet.in/hsa/gft.htm
Acknowledgements AD is supported by the interdisciplinary program under the DST Centre for Mathematical Biology. The facilities provided by NMR Research Centre, supported by Department of Science and Technology (DST), India is gratefully acknowledged. HSA acknowledges support from DAE-BRNS research grant. DP gratefully acknowledges support from Department of Biotechnology (DBT) for computing facilities.
Supporting Information Available This information is available free of charge via the Internet at http://pubs.acs.org/.
24 ACS Paragon Plus Environment
Page 25 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
References
(1) Clayton, T. A.; Lindon, J. C.; Cloarec, O.; Antti, H.; Charuel, C.; Hanton, G.; Provost, J. P.; Le Net, J. L.; Baker, D.; Walley, R. J.; Everett, J. R.; Nicholson, J. K. Nature 2006, 440, 1073-1077. (2) Kell, D. B. Current opinion in microbiology 2004, 7, 296-307. (3) Kell, D. B. Drug discovery today 2006, 11, 1085-1092. (4) Saito, K.; Matsuda, F. Annual review of plant biology 2010, 61, 463-489. (5) Van Dien, S.; Schilling, C. H. Molecular systems biology 2006, 2, 2006 0035. (6) Weckwerth, W. Bioanalysis 2010, 2, 829-836. (7) Hao, J.; Liebeke, M.; Astle, W.; De Iorio, M.; Bundy, J. G.; Ebbels, T. M. Nature protocols 2014, 9, 1416-1427. (8) Guennec, A. L.; Giraudeau, P.; Caldarelli, S. Analytical chemistry 2014, 86, 5946-5954. (9) Wishart, D. S.; Jewison, T.; Guo, A. C.; Wilson, M.; Knox, C.; Liu, Y.; Djoumbou, Y.; Mandal, R.; Aziat, F.; Dong, E.; Bouatra, S.; Sinelnikov, I.; Arndt, D.; Xia, J.; Liu, P.; Yallou, F.; Bjorndahl, T.; Perez-Pineiro, R.; Eisner, R.; Allen, F.; Neveu, V.; Greiner, R.; Scalbert, A. Nucleic acids research 2013, 41, D801-807. (10) Ulrich, E. L.; Akutsu, H.; Doreleijers, J. F.; Harano, Y.; Ioannidis, Y. E.; Lin, J.; Livny, M.; Mading, S.; Maziuk, D.; Miller, Z.; Nakatani, E.; Schulte, C. F.; Tolmie, D. E.; Kent Wenger, R.; Yao, H.; Markley, J. L. Nucleic acids research 2008, 36, D402-408. (11) Cui, Q.; Lewis, I. A.; Hegeman, A. D.; Anderson, M. E.; Li, J.; Schulte, C. F.; Westler, W. M.; Eghbalnia, H. R.; Sussman, M. R.; Markley, J. L. Nature biotechnology 2008, 26, 162-164. (12) Xia, J.; Bjorndahl, T. C.; Tang, P.; Wishart, D. S. BMC bioinformatics 2008, 9, 507. (13) Bingol, K.; Li, D. W.; Bruschweiler-Li, L.; Cabrera, O. A.; Megraw, T.; Zhang, F.; Bruschweiler, R. ACS chemical biology 2015, 10, 452-459... (14) Pudakalakatti, S. M.; Dubey, A.; Jaipuria, G.; Shubhashree, U.; Adiga, S. K.; Moskau, D.; Atreya, H. S. Journal of biomolecular NMR 2014, 58, 165-173. 25 ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 26 of 27
(15) Nagana Gowda, G. A.; Gowda, Y. N.; Raftery, D. Analytical chemistry 2015, 87, 706-715. (16) Giraudeau, P. Magnetic resonance in chemistry 2014, 52, 259-272. (17) Tannock, I. F.; Rotin, D. Cancer research 1989, 49, 4373-4384. (18) Issaq, H. J.; Veenstra, T. D. Proteomic and Metabolomic Approaches to Biomarker Discovery; Elsevier Science, 2013. (19) Shimizu, A.; Ikeguchi, M.; Sugai, S. J Biomol NMR 1994, 4, 859-862. (20) Lam, Y.-F.; Kotowycz, G. FEBS Letters 1977, 78, 181-183. (21) Chikayama, E.; Suto, M.; Nishihara, T.; Shinozaki, K.; Kikuchi, J. PloS one 2008, 3, e3805. (22) Wolfson, H. J.; Rigoutsos, I. Ieee Comput Sci Eng 1997, 4, 10-21. (23) Wolfson, H. J. Computer Vision - Eccv 90 1990, 427, 526-536. (24) Chikayama, E.; Sekiyama, Y.; Okamoto, M.; Nakanishi, Y.; Tsuboi, Y.; Akiyama, K.; Saito, K.; Shinozaki, K.; Kikuchi, J. Analytical chemistry 2010, 82, 1653-1658. (25) Chignola, F.; Mari, S.; Stevens, T. J.; Fogh, R. H.; Mannella, V.; Boucher, W.; Musco, G. Bioinformatics 2011, 27, 885-886. (26) Vranken, W. F.; Boucher, W.; Stevens, T. J.; Fogh, R. H.; Pajon, A.; Llinas, M.; Ulrich, E. L.; Markley, J. L.; Ionides, J.; Laue, E. D. Proteins 2005, 59, 687-696. (27) Keller, R. {CARA: computer aided resonance assignment. http://cara.nmr.ch/}, 2004. (28) Bingol, K.; Bruschweiler-Li, L.; Li, D. W.; Bruschweiler, R. Analytical chemistry 2014, 86, 54945501.
26 ACS Paragon Plus Environment
Page 27 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
For use of Table of contents/abstract graphic
27 ACS Paragon Plus Environment