Subscriber access provided by UB + Fachbibliothek Chemie | (FU-Bibliothekssystem)
Article
An Integrated Analytical and Statistical Two-Dimensional Spectroscopy Strategy for Metabolite Identification: Application to Dietary Biomarkers Joram M. Posma, Isabel Garcia-Perez, James Christopher Heaton, Paula Burdisso, John C. Mathers, John Draper, Matthew R. Lewis, John C Lindon, Gary S. Frost, Elaine Holmes, and Jeremy Kirk Nicholson Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.6b03324 • Publication Date (Web): 27 Feb 2017 Downloaded from http://pubs.acs.org on March 1, 2017
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
Analytical Chemistry is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 28
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Analytical Chemistry
JOURNAL Analytical Chemistry (ac-2016-03324s.R2) TITLE An Integrated Analytical and Statistical Two-Dimensional Spectroscopy Strategy for Metabolite Identification: Application to Dietary Biomarkers AUTHOR NAMES Joram M. Posma,†,⊥ Isabel Garcia-Perez,†,‡,⊥ James C. Heaton,†,∆ Paula Burdisso,†,# John C. Mathers,§ John Draper,ǁ Matt Lewis,†,∩ John C. Lindon,† Gary Frost,‡ Elaine Holmes,†,∩,* and Jeremy K. Nicholson†,∩,* AUTHOR ADDRESS † Section of Biomolecular Medicine, Division of Computational and Systems Medicine, Department of Surgery and Cancer, Faculty of Medicine, South Kensington Campus, Imperial College London, London, SW7 2AZ, United Kingdom; ‡ Nutrition and Dietetic Research Group, Division of Diabetes, Endocrinology and Metabolism, Department of Medicine, Faculty of Medicine, Hammersmith Campus, Imperial College London, London, W12 ONN, United Kingdom; § Human Nutrition Research Centre, Institute of Cellular Medicine, Newcastle University, United Kingdom; ǁ Institute of Biological, Environmental and Rural Sciences, Aberystwyth University, United Kingdom; ∩ MRC-NIHR National Phenome Centre, Department of Surgery and Cancer, Faculty of Medicine, Hammersmith Campus, Imperial College London, London, W12 0NN, United Kingdom KEYWORDS Analytical Framework, Bayesian Analysis, Controlled Clinical Trial, Dietary Biomarkers, Food Challenge, Gaussian Mixture Model, Markov-Chain Monte-Carlo, Statistical Spectroscopy
1 ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
26 27
ABSTRACT A major purpose of exploratory metabolic profiling is for the identification of molecular
28
species that are statistically associated with specific biological or medical outcomes;
29
unfortunately the structure elucidation process of unknowns is often a major bottleneck in
30
this process. We present here new holistic strategies that combine different statistical
31
spectroscopic and analytical techniques to improve and simplify the process of metabolite
32
identification. We exemplify these strategies using study data collected as part of a dietary
33
intervention to improve health and which elicits a relatively subtle suite of changes from
34
complex molecular profiles. We identify three new dietary biomarkers related to the
35
consumption of peas (N-methyl nicotinic acid), apples (rhamnitol) and onions (N-acetyl-S-
36
(1Z)-propenyl-cysteine-sulfoxide) that can be used to enhance dietary assessment and assess
37
adherence to diet. As part of the strategy, we introduce a new probabilistic statistical
38
spectroscopy tool, RED-STORM (Resolution EnhanceD SubseT Optimization by Reference
39
Matching), that uses 2D J-resolved 1H-NMR spectra for enhanced information recovery using
40
the Bayesian paradigm to extract a subset of spectra with similar spectral signatures to a
41
reference. RED-STORM provided new information for subsequent experiments (e.g. 2D-
42
NMR spectroscopy, Solid-Phase Extraction, Liquid Chromatography prefaced Mass
43
Spectrometry) used to ultimately identify an unknown compound. In summary, we illustrate
44
the benefit of acquiring J-resolved experiments alongside conventional 1D 1H-NMR as part
45
of routine metabolic profiling in large datasets and show that application of complementary
46
statistical and analytical techniques for the identification of unknown metabolites can be used
47
to save valuable time and resource.
48
2 ACS Paragon Plus Environment
Page 2 of 28
Page 3 of 28
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
49
INTRODUCTION
50
Dietary interventions (DIs) are a cornerstone in the management of reducing the risk of non-
51
communicable diseases1-3 and promoting healthy aging4. However, understanding the
52
response to dietary change is compromised by poor compliance to dietary recommendations
53
and the inherently inaccurate self-reporting dietary recording tools available, with prevalence
54
of misreporting estimated at 30–88%5,6, lowering the value of such studies and data
55
It has been demonstrated that dietary biomarkers can reflect consumption of specific
56
foods and enhance dietary intake assessment at individual and population levels7-17. Dietary
57
biomarkers are based on the concept that food intake is highly correlated with excretion
58
levels of food-related compounds over a given period of time. These ‘biomarkers’ can be
59
compounds that are excreted unchanged10,17 or that have undergone metabolic conversion for
60
example by gut bacteria8,11,13,14. Metabolic profiling of biofluids using spectroscopic
61
technologies18 can detect thousands of compounds simultaneously, generating a profile that
62
can be related to specific states of health or disease. Some of the compounds in these spectral
63
profiles are potential biomarkers of dietary intake. However, finding associations between
64
self-reported dietary intakes and excreted metabolites19,20 in order to discover potential food
65
biomarkers is plagued by inaccuracies of self-reporting. Thus, confirmation of a candidate
66
biomarker is best achieved by using a controlled food challenge with subsequent validation in
67
a larger population or study cohort.17 The complementarity of the main workhorses of metabolic profiling, one-dimensional
68
Nuclear
Magnetic
Resonance
(1H-NMR)
69
proton
spectroscopy
70
chromatograpic-mass spectrometry (MS) techniques, has been extensively demonstrated in
71
the last decade21-23. However, chemical characterization of molecular species associated with
72
an outcome still is a limiting factor of exploratory metabolic profiling. NMR spectroscopy
73
provides an atom-centered spectroscopic tool for structure elucidation that can be enhanced 3 ACS Paragon Plus Environment
and
hyphenated
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
74
by statistical spectroscopic methods24,25 or by physical hyphenation with chromatographic
75
methods such as Solid-Phase Extraction (SPE)26, Liquid Chromatography (LC)21 or by LC-
76
NMR-MS21,27 to achieve a better chemical characterization of endogenous and exogenous
77
metabolites.
78
Metabolite identification in 1H-NMR spectroscopy is aided by the intrinsic correlation
79
of peaks from the same metabolite. Statistical TOtal Correlation SpectroscopY (STOCSY)28
80
makes use of this property by calculating the correlation between one spectral variable
81
(driver) and all other variables to uncover structural associations. In cases where there are
82
sufficient spectra, identification of 1H-NMR peaks using statistical methods is an efficient
83
strategy that utilizes existing spectral data without requiring additional spectroscopic
84
experiments a priori, which has obvious advantages in usage of volume-limited samples and
85
is cost-efficient.23 Since the STOCSY method was published, many derivations have aimed to
86
improve specific properties such as differentiation between structural and pathway
87
correlations by clustering, subset selection or stoichiometric relationships.24
88
Statistical correlation can be undermined by overlapped signals unrelated to the
89
metabolite of interest in a 1D-NMR spectrum and 2D-NMR experiments are still required for
90
unambiguous structure elucidation29. In addition, the structural information obtained using
91
statistical algorithms is dependent on criteria such as correlation thresholds28,30 or correlation-
92
distance cut-offs31. SubseT Optimization by Reference Matching (STORM)32 is a derivation
93
of STOCSY that aims to separate out confounding spectra that do not match a supplied
94
reference spectrum of a potential biomarker signal, thereby showing clearer spectral
95
correlations between variables for both low and high intensity signals. The reference
96
spectrum is a single spectrum that contains the signal of interest. The peak segment is
97
correlated with the same region of all samples and a high correlation indicates the samples
98
contain the same signal and are likely ‘informative’, whereas samples with a low correlation 4 ACS Paragon Plus Environment
Page 4 of 28
Page 5 of 28
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
99
do not have this signal and are uninformative. Subsets of spectra and variable correlations are
100
found by carefully correcting for multiple testing in both phases and using statistical
101
shrinkage. Here, we describe a holistic strategy for the identification of unknown metabolites
102
that combines the strengths of statistical spectroscopy, NMR, multiple separation techniques
103
and MS, and apply the strategy to identify three novel dietary biomarkers. In addition, we
104
demonstrate an extension of STORM to 2D-NMR spectra to uncover the identity of unknown
105
metabolites.
106 107
EXPERIMENTAL SECTION
108
Food challenges (FCs) – To discover urinary biomarkers of peas, apples and onions three
109
FCs were designed. A total of nine healthy participants (4 women, 5 men; non-smokers; age:
110
22–32y; BMI: 21.2–25.3 kg/m2) were recruited and assigned to one of the three FCs.
111
Participants were provided with the assigned foods as part of a standardized dinner (incl.
112
125g of chicken breast as a protein source). Incremental amounts of the designated food were
113
consumed over three consecutive days: 60/120/180g for (boiled) peas, 40/80/160g for (raw)
114
apples and 20/40/60g for (fried) onions. For 24hr preceding the FC, and throughout the FC,
115
participants were asked to consume their habitual diet and avoid consumption of
116
coffee/tea/cocoa and any additional amounts of assigned foods. Cumulative urine samples
117
were collected into sterilized single-use urine containers (International Scientific Supplies
118
Ltd., Bradford, U.K.) from dinner up to and including the first morning void. Urine samples
119
were stored at −80°C until analysis.
120
Controlled clinical trial (CCT) – 19 healthy participants (9 women, 10 men; non-smokers;
121
age: 25–60y; BMI: 21.1-33.3 kg/m2) attended the NIHR/Wellcome Trust Imperial Clinical
122
Research Facility for four 3-day inpatient periods, separated by a period of >4 days, with
123
food and drink intake tightly controlled (alcohol/coffee/tea/cocoa were not provided). In 5 ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 6 of 28
124
random order, participants followed all four DIs representing 100% (Diet 1), 75% (Diet 2),
125
50% (Diet 3) and 25% (Diet 4) of WHO healthy eating guidelines1 with respect to
126
carbohydrates, fats, fiber, fruits, salt sugar and vegetables. Full details of the clinical trial
127
design have been described previously33. Foods consumed relevant to the present study are
128
tabulated in Supporting Information.
129
Each participant collected cumulative urine samples (CS) on each day of each DI
130
from after breakfast to before lunch (CS1), after lunch to before dinner (CS2) and after dinner
131
to before breakfast the following day (CS3). 24hr urine samples were obtained by pooling the
132
cumulative samples. Aliquots of urine were transferred into Eppendorf tubes and stored at
133
−80°C until analysis. All participants provided informed, written consent prior to the CCT
134
(Registration No: ISRCTN-43087333), which was approved by the London Brent Research
135
Ethics Committee (13/LO/0078). All studies were carried out in accordance with the
136
Declaration of Helsinki.
137
1
138
for 5min. All available samples (nFC=27, nCCT=906, for missing CCT data see Supporting
139
Information) were prepared for 1H-NMR spectroscopy following the protocol described in34,
140
mixing 540µL of supernatant with 60µL of pH7.4 phosphate buffer containing trimethylsilyl-
141
[2,2,3,3,-2H4]-propionate as an internal reference standard (‘NMR-buffer’). Water-suppressed
142
1
143
Biospin,
144
(RD−gz,1−90°−t−90°−tm−gz,2−90°−ACQ) with saturation of the water resonance7. RD is the
145
relaxation delay, t is a short delay (4µs), 90° represents a radio frequency (RF) pulse that tips
146
the magnetization by 90°, tm is the mixing time (10ms), gz,1 and gz,2 are magnetic field z-
147
gradients both applied for 1ms, and ACQ is the data acquisition period of 2.73s. 1H-NMR
148
spectra were acquired using 4 dummy scans and 32 scans, 64K time domain points and with a
H-NMR analysis – Aliquots of 600µL of urine samples were centrifuged at 16,000xg at 4°C
H-NMR spectroscopy was performed at 300K on a Bruker 600MHz spectrometer (Bruker Karlsruhe,
Germany)
using
a
standard
6 ACS Paragon Plus Environment
1D
pulse
sequence
Page 7 of 28
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
149
spectral window of 20ppm. Prior to Fourier transformation, free induction decays were
150
multiplied by an exponential function corresponding to a line broadening of 0.3Hz. 1H-NMR
151
spectra were normalized to the total urine volume to correct for differences in dilution.
152
1
H–1H 2D J-resolved experiments7 were acquired using a pulse sequence to detect the
153
J-couplings in the second dimension, with suppression of the water resonance (RD–90°–t1–
154
180°–t1–ACQ), where t1 is an incremented time period, RD is 2s and 180° represents a 180°
155
RF pulse, ACQ is 0.41s. J-resolved spectra were acquired using 16 dummy scans and 2 scans,
156
8K points with spectral window of 16.7ppm for f2 and 40 increments with spectral window
157
of 78Hz for f1. Continuous wave irradiation was applied at the water resonance frequency
158
using a 25Hz RF during the RD. A sine-bell apodization function was applied to f2 and a
159
squared sine-bell to f1 of the J-resolved data, followed by Fourier transformation, tilting by
160
45° and symmetrization along f1 before data analysis.
161
A suite of 2D-NMR experiments including 1H–1H TOtal Correlation SpectroscopY 1
H–1H COrrelation SpectroscopY (COSY),
1
H–13C Hetero-nuclear Single
162
(TOCSY),
163
Quantum Coherence (HSQC) and
164
(HMBC) spectroscopy were used for identification purposes25,29.
165
SPE-NMR – Apple extracts were homogenized using a Kenwood KMix Blender for 5min.
166
The puree obtained was filtered using a stainless steel filter and centrifuged for 10min at
167
16,000xg. 2mL of each sample (urine/apple) were lyophilized overnight. Freeze-dried (FD)
168
urine/apple samples were dissolved in 1mL of 50mM sodium phosphate pH8.5 and briefly
169
sonicated prior to being loaded onto a 100 mg/ml Bond Elut™ phenylboronic acid (PBA)
170
SPE-cartridge (Agilent Technologies, Stockport, U.K.). The SPE-cartridge was conditioned
171
with 1mL of 70:30v/v H2O:ACN, 0.1M HCl followed by 1mL of 50mM sodium phosphate
172
pH10. After conditioning, the sample was loaded onto the SPE-cartridge. The loaded sample
173
was washed with 2mL 50:50v/v ACN:sodium phosphate (10mM) pH8.5. The sample was
1
H–13C Hetero-nuclear Multiple-Bond Correlation
7 ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
174
eluted using 1mL of acidified solutions (0.1M HCl) of 100% H2O, 90:10v/v H2O:ACN and
175
70:30v/v H2O:ACN and subsequently lyophilized until dry. The dried eluent fractions were
176
reconstituted in 540µL H2O and 60µL of NMR-buffer, vortexed and centrifuged prior to
177
NMR analysis.
178
LC-NMR-MS27 – 5mL of urine collected overnight after consumption of onion was
179
lyophilized overnight, reconstituted in 500µL of the original urine sample and vortexed,
180
sonicated and centrifuged (20min at 16,000xg). The supernatant was repeatedly injected
181
(7×2µL) onto a reversed-phase HPLC column (Waters Atlantis-T3, 3µm, 4.6×150mm at
182
30°C) in a Waters Acquity UPLC comprising a binary solvent manager and photodiode array
183
detector with a Waters CTC auto-sampler with 100µL sampling needle, and eluted at
184
0.8mL/min using the following gradient: 0.0–60.0min (99.9:0.1% H2O/formic acid), 60.01–
185
65.0min (99.9:0.1% methanol/formic acid), 65.01–127.5min (99.9:0.1% H2O/formic acid).
186
The chromatographic separation of the sample was fractionated using a Waters Fraction
187
Collector III. A total of 120 fractions were collected, one every 29s (starting at t=5min,
188
finishing at t=63min), and dried under a stream of nitrogen. Each fraction was re-dissolved in
189
540µL of H2O and 60µL of NMR-buffer and analyzed by 1H-NMR. A volume of 50µL of the
190
fraction containing the unknown metabolite was analyzed by reversed-phase LC-MS, Waters
191
Acquity Ultra Performance LC system coupled to Xevo G2 Q-TOF mass spectrometer
192
(Waters, Milford, MA, U.S.A.), following an established metabolic profiling method35. The
193
optimized capillary voltage, cone voltage and collision energy were 3kV/20V/4V for ESI+
194
and 1.5kV/30V/6V for ESI-, using a source temperature of 120°C and desolvation
195
temperature of 600°C. Desolvation was 1000L/h for both and cone gas flows 50L/h (ESI+)
196
and 150L/h (ESI-). Leucine enkephalin was used as the reference lock mass at 556.2771
197
([M+H]+) and 554.2615 ([M-H]-).
8 ACS Paragon Plus Environment
Page 8 of 28
Page 9 of 28
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
198
RED-STORM algorithm – Here, STORM was extended in a probabilistic framework and
199
modified for applicability to 2D data to provide a clearer signature of structural correlations
200
in the data and the extended algorithm was named Resolution EnhanceD SubseT
201
Optimization by Reference Matching (RED-STORM).
202
High correlations in the data are of interest between samples and a reference spectrum
203
of a signal of interest (for subset optimization) and between a driver and all other variables
204
(for assessing variable importance). However, high correlations are not normally distributed
205
because of the upper bound on the correlation ([-1,1]), this results in their distributions being
206
negatively skewed. Therefore, the correlation () is transformed to a Fisher z-score, which
207
results in approximately normally distributed data that can be analyzed using parametric
208
methods:
209
1.
210
Next, the distribution of all z-scores is modelled using a Gaussian Mixture Model (GMM). A
211
GMM is a weighted sum of k Gaussian clusters and is defined as
212
2.
213
Where has data points, are the means, the variances, the mixture weights for
214
= ln
|, , = ∑ , ,
each of Gaussians ( ∑ = 1 ) and , a normalized Gaussian distribution with
215
specified and . In order to obtain clusters of variables from the Fisher z-transformed
216
correlations a parameter-free GMM is used36. The model automatically learns the optimal
217
number of clusters from the data. For completeness a description of the method in brief
218
follows, for mathematical proofs see36.
219
Each cluster mean ( ), and precision ( ) from equation 2 are given Gaussian
220 221
( |", # ∼ , ", # ) and Gamma ( |%, & ∼ Γ%, & ) priors, respectively. Here, " and # are hyperparameters for the means of all clusters. Similarly, % (shape) and & 9
ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 10 of 28
222
(scale) are hyper-parameters for the cluster precisions. The hyper-parameters themselves are
223
initialized from the observation mean (( ) and variance (( ) as vague priors ( " ∼
224
, ( , ( ; # ∼ Γ1, ( ; % ∼ Γ1,1 ; & ∼ Γ1, ( ). The posterior
225
distributions of the cluster means are obtained from
226
3.
229
(,- .-/ 01
,
2- .-/ 1 2- .-/ 1
3
̃ = ∑87 456, 7
with
and
* =
∑87 456 ,
227
228
|̃ , * , , ", # ∼ , +
where ̃ is the sum of observations from cluster 9 , * the number of observations with highest probability for cluster 9, :7 the cluster with highest probability for observation ; and
230
456, a Kronecker delta product. Similarly, the posterior distributions of the cluster precisions
231
are obtained from
232
4.
233
The posteriors for " ( "|, # ∼ , +
234
1,
235
the priors, whereas posterior for % takes a different form and makes use of the fact that
236
>? @
|* , %, & ∼ Γ 2 - 3 A with B = ∑87 456 , 7 − -
DE .EF/ 1 ∑G -HI D-
.E/ ∑G -HID- 0
/
.EF/ 1
A) and & (&| , % ∼ Γ +% + 1,
, .F/ 13 ), # ( #|, " ∼ Γ
F/ .EF/ > ∑G -HI .-
3) are obtained similarly to
J%| , & is log-concave36: %|
>
, & ∝ Γ
>
LM >
GNFO /
N
>.-F/ ?
∏