Untargeted Metabolome Quantitative Trait Locus ... - ACS Publications

For each spectral peak, six summary statistics were calculated and independently tested for evidence of genetic linkage in a cohort of F2 (129S6xBALB/...
12 downloads 0 Views 4MB Size
ARTICLE pubs.acs.org/jpr

Untargeted Metabolome Quantitative Trait Locus Mapping Associates Variation in Urine Glycerate to Mutant Glycerate Kinase Jean-Baptise Cazier,† Pamela J. Kaisaki,† Karene Argoud,† Benjamin J. Blaise,‡ Kirill Veselkov,§ Timothy M. D. Ebbels,§ Tsz Tsang,§ Yulan Wang,# Marie-Therese Bihoreau,† Steve C. Mitchell,§,^ Elaine C. Holmes,§ John C. Lindon,§ James Scott,|| Jeremy K. Nicholson,§ Marc-Emmanuel Dumas,‡,§,# and Dominique Gauguier*,†,& †

Wellcome Trust Centre for Human Genetics, Roosevelt Drive, University of Oxford, Oxford OX3 7BN, U.K. Universite de Lyon, FRE 3008 CNRS/ENS-Lyon/UCBL1 Centre de RMNa Tres Hauts Champs, 5 rue de la Doua, 69100 Villeurbanne, France § Biomolecular Medicine, Department of Surgery and Cancer, Imperial College London, South Kensington, London SW7 2AZ, U.K. # State Key Laboratory of Magnetic Resonance and Atomic and Molecular Physics, Wuhan Centre for Magnetic Resonance, Wuhan Institute of Physics and Mathematics, The Chinese Academy of Sciences, PR China ^ Metabometrix Ltd, Imperial Incubator, Bessemer Building, Prince Consort Road, London SW7 2BP, U.K. Department of Vascular Science, National Heart and Lung Institute, Imperial College London, South Kensington, London SW7 2AZ, U.K. & Centre de Recherche des Cordeliers, INSERM UMRS872, 15 rue de l’Ecole de Medecine, 75006 Paris, France

)



bS Supporting Information ABSTRACT: With successes of genome-wide association studies, molecular phenotyping systems are developed to identify genetically determined disease-associated biomarkers. Genetic studies of the human metabolome are emerging but exclusively apply targeted approaches, which restricts the analysis to a limited number of wellknown metabolites. We have developed novel technical and statistical methods for systematic and automated quantification of untargeted NMR spectral data designed to perform robust and accurate quantitative trait locus (QTL) mapping of known and previously unreported molecular compounds of the metabolome. For each spectral peak, six summary statistics were calculated and independently tested for evidence of genetic linkage in a cohort of F2 (129S6xBALB/c) mice. The most significant evidence of linkages were obtained with NMR signals characterizing the glycerate (LOD10-42) at the mutant glycerate kinase locus, which demonstrate the power of metabolomics in quantitative genetics to identify the biological function of genetic variants. These results provide new insights into the resolution of the complex nature of metabolic regulations and novel analytical techniques that maximize the full utilization of metabolomic spectra in human genetics to discover mappable disease-associated biomarkers. KEYWORDS: genetics, metabonomics, functional genomics, quantitative trait locus, NMR spectroscopy, glycerate, glycerate kinase

’ INTRODUCTION While the era of human genome-wide association studies (GWAS) will reach its natural limitations,1 progress in functional genomic technologies and investigations into the genetic determinants of gene expression keep developing from crossdisciplinary expertise to uncover disease-predictive and diseaseassociated biomarkers.2 Metabonomics3 is a systems biology approach based on the quantitative analysis of multivariate metabolic components in mutlicellular systems. It provides information simultaneously on a range of small molecules in a biological sample4 that can represent biomarkers underlying distal regulatory mechanisms of genome expression and is a promising approach in diabetes r 2011 American Chemical Society

and cardiovascular disease research.5,6 Mapping the genetic determinants of such direct biological end points and their various homeostatic functions can provide powerful insights into the mechanisms involved in disease etiology.7,8 Metabolomic profiling of biofluids and biopsy extracts in populations generates qualitative and quantitative molecular phenotyping snapshots at a range of time points and under diverse environmental conditions to provide a multidimensional insight into genetically mappable mechanisms through quantitative trait locus (QTL) analysis. Pioneering studies in plants9,10 Received: June 11, 2011 Published: October 26, 2011 631

dx.doi.org/10.1021/pr200566t | J. Proteome Res. 2012, 11, 631–642

Journal of Proteome Research

ARTICLE

have shown the possibility to link a specific metabolite to a genetic locus using 1H NMR spectroscopy intensity at successive resonances as quantitative phenotype variables. We have demonstrated in a metabolomic QTL (mQTL) study in rodents causal relationships between genetic polymorphism and variations in abundance of plasma metabolite and gene transcript.11 However, technical aspects such as sample quality, which varies across a study, spectral peak overlap and noise level complicate NMR spectral data quantification. Metabolomic profiles also fluctuate as the consequence of uncontrolled environmental changes and intrinsically heterogeneous genetic backgrounds in human populations and experimental cohorts. For these reasons, genetic studies of the human metabolome currently apply targeted approaches that restrict molecular profiling to 130160 wellknown metabolites,8,12 leaving unaddressed the genetic control of a large fraction of molecular compounds present in a biological sample. There is therefore a crucial need to develop new methods for robust and accurate mapping of untargeted mQTLs. The main limiter against robustness currently lies in the handling of various perturbations and their consequences on the selection of spectral peaks. Coupled with the very large number of resonances tested, this can rapidly lead to false positive QTLs, hence the requirement of corrections for multiple testing. In a preliminary study, we have addressed these confounding issues by systematic averaging NMR signals across a running window,13 which prevents bias when manual selection of peaks is applied but leads to an overrepresentation of quantitative metabolomic traits that have to be genetically mapped. To help resolve these issues, the development of automated peak identification tools can increase throughput and reduces dimensionality, thus increasing the number of individuals studied, hence the power, and can be applied in parallel to several modalities such as different tissues or time points. In this report, we have developed technical and statistical methods for systematic and automated quantification of various summary statistics derived from untargeted metabolomic spectral data ultimately designed to perform mQTL experiments. Results suggest causal relationships between mQTLs and genomic variants by the mapping of significant genetic linkages to NMR signals characterizing the glycerate (LOD10-42) at the glycerate kinase locus, which contains a frameshift mutation causing a premature stop codon in the gene encoding the enzyme. These data in a mouse cross provide novel insights into the resolution of the complex nature of metabolomic traits when applied in genetics and pave the way to the application of systematic untargeted metabolomics in complex trait genetics in humans.

guidelines on animal welfare and license conditions and University of Oxford guidelines on animal welfare Genotype Analysis and DNA Sequencing

DNA was prepared from spleen of F2 mice and inbred strains (129S6, C57BL/6, BALB/c) using the DNeasy blood and tissue kit (Qiagen Ltd., Crawley, UK). A panel of 1538 single nucleotide polymorphism (SNP) markers was used to generate genotype data using the Illumina GoldenGate assay protocol, allowing high-multiplex genotyping, in an Illumina Beadlab (Illumina, Inc. San Diego, CA), according to the manufacturer recommendations. Of those, 404 makers (26%) showing evidence of allele variation between 129S6 and BALB/c were used to construct a genetic map in the cross using the JoinMap suite of progams.14 For DNA sequencing PCR products were purified with the QIAquick PCR purification kit (Qiagen Ltd., Crawley, UK) and reactions were carried out (Big Dye), followed by ethanol precipitation. The dried pellets were sequenced by capillary sequencing. 1

H NMR Spectroscopy

Mouse urine samples were prepared using 200 μL of urine mixed with 200 μL of water and 200 μL of 0.1 M phosphate buffer (pH 7.4) solution (10% 2H2O/H2O v/v, with 0.05% sodium 3-trimethylsilyl-(2,2,3,3-2H4)-1-propionate for chemical shift reference at δ0.0). Standard 1H NMR spectra were measured on a Bruker Avance-600 (Bruker, Rheinstetten, Germany) operating at 600.22 MHz 1H frequency, as described previously.15 A standard 1D pulse sequence (recycle delay-90t1-90-tm-90-acquisition) was used. Efficient water peak suppression was achieved by irradiating the water peak during the recycle delay (2 s) and mixing time, tm, of 150 ms. The 90 pulses were calibrated to 10 μs and t1 was set to 3 μs. The acquisition consisted of 128 transient increments of 32K data points. Fourier transform with 0.3 Hz line broadening and zerofilling led to 1H NMR spectra with 64K data points. The spectra were phase- and baseline-corrected using in-house software and were imported into Matlab and R at high resolution as described previously.16 The regions δ6.05.5 and δ5.04.5 were removed to eliminate baseline effects of imperfect water saturation. NMR structural assignment was achieved using public databases (Human Metabolome Database, HMDB, www.hmdb.ca and COLMAR metabolomics, spinportal.magnet.fsu.edu) and in-house databases, and included measurement of spectra of authentic standards. Peak Alignment 1

H NMR spectroscopic signals from identical chemical groups can exhibit an uncontrolled variation in peak positions across multiple spectra for a variety of sample composition reasons such as differences in pH, ionic strength, metal ion concentration, and osmolality. Correcting for this unwanted peak position variation is necessary for subsequent statistical analysis. Recursive segment-wise peak alignment (RSPA) has been recently introduced and evaluated to possess the comparative advantages to other peak alignment approaches over a wide range of peak intensities17 and thus was employed here prior to statistical recoupling of resonances (Figure 1a). This transformation was implemented in Matlab. RSPA aims at realigning peak positions of a 1H NMR spectrum with respect to peak positions of a reference spectrum. The choice of a reference has been shown not to be crucial for the method performance; it can be any sample of a data set or, as typically selected, the

’ METHODS Animals

A large cross of F2 male hybrids (n = 187) was derived from mice of the BALB/cOxjr (BALB/c) and 129S6/SvEvOxjr (129S6) strains bred locally using a stock originating from the Jackson Laboratory. Mice were weaned at 3 weeks and housed in groups of 10  12 animals. They were fed ad libitum a standard carbohydrate and 5% low fat diet (LFD) (B&K Universal Ltd., Hull, UK) from weaning to 5 weeks of age, when they were fed a 40% high fat diet (HFD) (Special Diets Services, UK), containing 32% lard and 8% corn oil. Mice were maintained under a 12 h-12 h light-dark regime with lights off at 7 pm. Animal procedures were carried out in accordance with UK Home Office 632

dx.doi.org/10.1021/pr200566t |J. Proteome Res. 2012, 11, 631–642

Journal of Proteome Research

ARTICLE

Figure 1. Outlined strategy applied to metabolomic data quantification prior to QTL mapping. Methods were optimized for spectral data alignment by recursive segment-wise peak alignment (a), quantification of NMR peaks using statistical recoupling of variable based on various summary statistics (b), metabolite identification (c). Each arrow corresponds to a peak as identified by the corresponding statistics (c). The black line in (c) shows the consensus peaks drawn as a binary addition: a score of 63 corresponding to consensus across all the measures, a score of 15 to all but trapezium and rectangle as mean = 1, median = 2, sum = 4, max = 8, trapezium = 16, rectangle = 32. Statistical significance (d) at 5% FDR derived from simulation analysis and assessed through 500 simulations under the null hypothesis of no linkage with the summary statistics calculated and the running method. For each combination of alignment and summary measure, the LOD score corresponding to 5% FDR is given. The top row in (d) shows the distribution of score for various summary measures, without any prior alignment, whereas the bottom row includes a preprocessing alignment step by RSPA.

median profile.17 Corrections for complex peak shifts in a sample (“test”) spectrum in RSPA are simplified into subcorrections for peak shifts in its segments relative to peak positions of corresponding reference segments. Segments are defined to contain multiple peaks such that any multiplet is most likely to be represented in a single segment. The peak shifts in each test segment are recursively corrected to match peak positions of its

corresponding reference segment in a top down fashion, subdividing the initial larger segment into smaller ones, as required, to improve peak alignment. The recursion starts by shifting peaks in a test segment as a whole and then progresses to smaller subsegments until no further peak alignment is required. Segment alignment toward the reference spectrum is achieved by linear shifting and interpolation at segment 633

dx.doi.org/10.1021/pr200566t |J. Proteome Res. 2012, 11, 631–642

Journal of Proteome Research

ARTICLE

boundaries. The optimal shift position is found at the maximum of the cross correlation function: corðr, sÞd ¼

N

∑ rðiÞsði þ dÞ i¼1

dopt ¼ argmax ðcorðr, sÞd Þ

pattern. Then consecutives values were collapsed together if they were correlated across the samples using statistical recoupling of variables (SRV).18 SRV identifies relationships of covariance and correlation between consecutive variables along the chemical shift axis to cluster together (i.e., to recouple) variables belonging to the same physical, chemical, or biological entity. The SRV algorithm uses a measure of a local spectral dependency based on the ratio between covariance and correlation of two consecutive variables. In NMR spectroscopy, a peak is typically defined by several contiguous NMR variables, which means there is a local dependency within the NMR spectrum: knowledge about the variation of variable ppm implies knowledge about the variation of variable ppm +1.18 Typically, a well-resolved NMR peak from a biological molecule has a peak base resolution of 510 Hz, corresponding to about 10 consecutive NMR variables on the spectrum acquired at 600 MHz and imported at a resolution of 103 ppm. This measure will be used to derive the spectral dependency landscape (i.e., the profile of spectral dependency). For two consecutive variables ppm and ppm+1 along the NMR chemical shift axis, the spectral dependency SpD (Li) of the data set is defined as follows:

ð1Þ ð2Þ

d

where d is a shift between reference (r) and test (s) segments and N is a segment length. The calculation of the cross correlation function is accelerated by the Fast Fourier transform, which reduces the computational complexity from order of N2 to N log N. FT

s f RðjÞS/ðjÞ corðr, sÞd r

ð3Þ

where R and S are the Fourier transform of functions of r and s, respectively, and * denotes a complex conjugation. The cross correlation function is calculated by applying a fast Fourier transform to both r and s functions, multiplying the transformed functions, R and S*, and finally performing the inverse Fourier transform of this product. After shifting, the test segment is linearly interpolated at its boundaries. In the case of a segment containing a single multiplet, it will be aligned as a whole, requiring no further alignment and thus stopping recursion. When there are additional peaks in a segment, local recursive alignment will progress to smaller subsegments and align those peaks exhibiting positional variation. Finally, a test spectrum is reconstructed by joining aligned segments together. The method is computationally fast, requiring only minutes to align a data set of hundreds of NMR spectra of complex mixtures with 64K point resolution on a standard personal computer. This approach is a key to our untargeted strategy to test blindly all possible metabolites from the NMR profile, rather than focus on identified ones.

covariance ði, i þ 1Þ correlation pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ varianceðiÞ:varianceði þ 1Þ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 N 1 N ¼ ði  ̅ i Þ2 ðði þ 1Þ  ði þ 1ÞÞ2 N i¼1 N i¼1

LðiÞ ¼





ð4Þ

where i = ppm. The algorithm scans the SpD landscape to identify clusters of consecutive variables with high dependency, by identifying local minima. A cluster is considered as corresponding to a physical NMR signal if the cluster has a line-width of 0.7 Hz, which corresponds to 10 consecutive NMR variables (from variable ppm to variable ppm+10 along the horizontal ppm axis) acquired at 700 MHz and sampled at a 103 ppm resolution. This step typically identifies peak boundaries. The algorithm then tests whether consecutive clusters are correlated to identify potential neighboring peaks belonging to a multiplet. If two consecutive clusters (peaks) are correlated, they are considered as being part of a multiplet and are therefore aggregated into a single cluster. We used here a correlation threshold of 0.9 for multiplet detection. SRV clusters were then summarized as described in the next section. We used here a correlation threshold of 0.9 to separate successive peaks with the risk of testing the same metabolite multiple times. All onward computational analyses were implemented in R,19 an open source system for statistical computation and graphics.

Dimensionality Reduction and Statistical Recoupling

Our NMR experiments were acquired using 32K contiguous data points in the time domain (as described above). To reveal the resonances in the frequency domain, a Fourier transform is performed, typically resulting in 1H NMR spectra with 64K contiguous data points, or “raw” variables. Noticeably, each NMR variable is not necessarily informative as it corresponds to a given frequency. Some NMR spectral regions do not correspond to metabolic information, and no signal is detected at such frequencies region apart from pure electronic noise from the electronics of the instrument itself. On the other hand, the region between 1 and 10 ppm, corresponds to resonances from 100 to 1000s of biological molecules. However, only about 50100 individual compounds can be reliably assigned to a single NMR spectrum, that is, 2 orders of magnitude less than the number of inputs (NMR variables). As a consequence, there is a strong multi-colinearity in the NMR data set, which directly impacts the number of univariate statistical tests involved in QTL mapping. The number of tests to be performed must therefore be reduced by being more specific, thus requiring spectral peak alignment (Figure 1a). To our advantage, those plentiful measures are not random and follow a locally continuous pattern corresponding to specific metabolites, or their absence, at a given NMR resonance position. We therefore aim to reduce the number of traits to test by focusing on the resonances corresponding to metabolites. First we identified 1H NMR metabolic signals, defined by consecutive variables following a structured

Summary Statistics

We extended the original SRV method to apply a variety of measures to summarize each NMR peak and assess whether to collapse consecutive values together. We further improved the original version of SRV based on the mean value across a peak, to median, sum (area), and maximum, as well as variations of the area, subtracting a rectangle below the minimum value, or a trapezium below the extreme values to remove potential baseline noise (Figure 1b). Although these summary statistics are strongly correlated, they can provide some extreme changes in the associated signal due to their sensitivity and specificity at a given 634

dx.doi.org/10.1021/pr200566t |J. Proteome Res. 2012, 11, 631–642

Journal of Proteome Research

ARTICLE

Table 1. Chromosomal Mapping of 1H NMR Metabolomic Variables in F2 (129S6xBALB/c) Hybrid Micea Loc res ppm

candidate metabolite

Chr:Pos

0.941

alpha-ketoisocaproate/isoleucine

2:50.0

1.258

adipate alpha-ketoadipate 3-methylglutarate beta-hydroxybutyrate

3:94.6

LOD sum 8.1

b

max

trap.

rect.

4.0

10.1

7.0

6.3

11.6

9.5b

6.3

18.0b 6.2

b

10.3

median b

6.6

mean

run.

Al. run

b

8.4

N-acetylglutamate 1.542

short chain fatty acids: sebacate, adipate, butyrate

13:5.0

1.569

butyrate

13:5.0

1.878

unknown

7:22.7

4.3

1.988

unknown

6:39.7

6.9

5.6

2.167

butyrate folate homoserine methionine formiminoglutamate gamma-carboxyglutamate sebacate

2:61.7

15.8b

9.5b

2.419

beta-hydroxybutyrate pyroglutamate succinate beta-ketoadipate

6.3

6.7

10.2b

5.8

5.9

13:0.0

pantothenate 13:5.0

6.1

14.5b

9.8b

alpha-ketoglutarate

13:5.0

6.5

11.5

11.5b

2.479

unknown

13:0.0

5.5

6.4

3.134 3.708

unknown unknown

9:60.0 9:60.0

6.1 8.3b

5.9 11.2

10.0b

3.721

glycerate

9:61.7

6.1

7.3b

15.0b

20.1b

11.1

9.4b

9.6

9.0b

5.1 22.6b

18.7b

2.428

beta-hydroxybutyrate pyroglutamate succinate beta-ketoadipate

2.452

pantothenate

3.725

glycerate

3.732

glycerate

b

9.60.0

11.3

9:60.0

3.741

unknown

9:60.0

3.822

glycerate

9:60.0

3.827

glycerate

9:60.0

3.841 4.086

unknown glycerate

9:60.0 9:45.0

8.9

b

7.7

b

b

12.4

b

10.5

b

6.5

b

7.0

b

8.4

b

7.3

39.2b

9:60.0

27.6b

9:65.0 4.092

glycerate

9:60.0

4.095

glycerate

9:60.0

4.097

glycerate

b

35.5

b

b

40.2

42.3

25.2b b

9:60.0

31.7

9:65.0 4.108 4.113

unknown unknown

9:60.0 9:60.0

4.240

phosphoserine guanidinosuccinate guanosine indole-3-pyruvate

9:55.0

5.350

unknown

8.8

b

b

13.0

b

9.8

21.3b

28.2b

7.3

11.6b

b

10.1

29.3b 10.1b

o-succinylhomoserine uridine 9:61.7

6.0

a

Results are shown for all metabolites showing evidence of linkage (LOD > 6) for any summary statistics with corresponding resonance, genetic location, and LOD score. The best LOD score per locus is reported in bold. b Indicates that the LOD score is significant at 95% for this measure. Details of 1H NMR resonances for all summary statistics are shown in Supplementary Table 1, Supporting Information.

obtained with other measures. We kept as well the original version described13 of a running window through the NMR profile, although this approach does not reduce dimensionality. This method was used both with the original set (“running”) or the aligned set (“aligned running”). Because of the measurespecific definition of the peaks, the comparison of metabolite by their optimal peak can only be approximate (Table 1).

resonance (Figure 1c) and overall (Figure 1d). The choice of measure not only defines the trait value associated with a peak, but the peak definition itself, as successive peaks’ values are tested for correlation before an eventual merger. As a consequence, for each measure, the number of peaks identified, and therefore tested, depends on the summary statistics used. From the distribution of the LOD score under the null hypothesis of no linkage, the best sensitivity is obtained with the rectangle measure (see Results), which corresponds best to the quantification of a metabolite as defined by the area under the NMR peak, yet correcting for a baseline introduced by a neighboring peak from another metabolite. For simplicity, we report only data obtained with the rectangle measure unless specific results were

Quantitative Trait Locus Mapping

We used the R/QTL package, an open software for mapping quantitative trait loci in experimental implemented as an R package by Broman et al.20 to perform the QTL analysis. The same genotypes described above were used throughout the 635

dx.doi.org/10.1021/pr200566t |J. Proteome Res. 2012, 11, 631–642

Journal of Proteome Research

ARTICLE

package.20 However, this does not take into account the multiple-trait testing which can be very significant if all NMR measurements were to be used. The level of metabolomegenome-wide significance was evaluated by permutations of both the genotypes and phenotypes, in order to test how many times a better top score was obtained by chance. The number of tests performed depends on the adjusted significance level required to obtain two meaningful digits; for example, to get a FDR of 5%, the observed LOD score would have to be ranked fifth from the 100 best LOD scores generated by simulations under the null. Therefore, to provide a more general view of significance level obtained with our approach, we have simulated 500 cohorts of 168 animals under the null hypothesis. We used again MetAssimulo to generate realistic NMR profiles, but for this mQTL analysis, traits in the profile follow a single random distribution, unlinked to the genotype. The best LOD scores for the each 500 metabonome-genome association were collected to define a distribution of best scores under the null (Figure 1d). Not only does this provide the expected 5% FDR level of LOD scores, but allowed the assignment of a significance p-value for each observed top LOD score.

experiment, but the trait used for the mapping depended on the type of analysis. To evaluate the LOD score, we used the extended Haley-Knott method to optimize the scoring21 in order to avoid problems with missing genotypes and ensure sufficient speed. Simulations

Two types of simulations were performed to evaluate the distribution of top LOD score in a random set of animals, and to compare the expected experimental result to the evidence by generation of a model cohort. In both cases, we used the MetAssimulo package22 to generate realistic 1H NMR profiles. MetAssimulo is a 1H NMR urinary metabolic profile simulator, generating a 1H NMR profile based on a set of metabolite concentrations. It uses a spectral library of pure individual compounds (one spectrum per compound, known concentration), which are multiplied by the desired metabolic concentration matrix, leading to an entirely simulated data set. Using a set of simulated concentration matrices, we generated several 1H NMR simulated data sets for validation purposes. MetAssimulo combines experimentally acquired standard 1H spectra of known metabolites, yk(δ), at desired concentrations ck to simulate the NMR spectrum of the mixture of K metabolites using the following equation: yðδÞ ¼

K

∑ ykðδÞckpk þ εðδÞ k¼1

ð5Þ

’ RESULTS Experimental Set up and Properties of 1H NMR Metabolic Profiles

For all 168 (129S6xBALB/c) mice of the F2 cross, urine 1H NMR spectra were acquired at a temperature of 300 K at a spectrometer frequency of 600 MHz to characterize low-molecular weight metabolites such as amino acids, small organic acids, and energy metabolites. The properties of spectral metabolic profiles are manifold. Each metabolite gives rise to a 1H NMR spectrum where each chemically distinct hydrogen produces a peak (chemical shift in Hz or ppm) at a specific position on the frequency axis. Multiplet structure complicates the peaks due to magnetic interactions between the hydrogens transmitted via the chemical bonds (J-coupling). The NMR spectrum of a mixture can be regarded as a superposition of the spectra of the components to a good approximation. The intensity of a peak is proportional to the number of protons contributing, and hence absolute concentrations can be determined when an internal standard of known structure and purity is added. Typical 1H NMR spectra comprise 65 536 (64K) data points and describe the intensity of signals for each resonance frequency. Spectra were imported at high resolution as previously described,16 and the process resulted in importing about 18K data points for a given 10 ppm interval (at resolution of 5.5  104 ppm). Because of imperfections in the water presaturation, the region between 4.5 ppm and 6 ppm was discarded from further analysis and removed from the data set, which resulted in the acquisition of a total 13 457 NMR variables between [0.44.2489] and [6.459.9997] ppm. Since each pure metabolite generates a characteristic spectrum, this enables peak annotation and hence metabolite identification. However, structural similarity between metabolites can result in spectral peak overlap. Also, despite preparation of urine samples in a pH 7.4 phosphate buffer solution, slight variations in pH, ionic strength, or molecular interactions between metabolites such as chelation result in small but significant variability in chemical shifts, termed “positional noise”.16

where pk denotes the number of NMR observable protons in metabolite k and ε(δ) is a normally distributed error at chemical shift δ. The software allows the user to specify means and standard deviations for each metabolite in simulated “case” and “control” groups and provides a template using literature values for the 48 most abundant NMR observable metabolites in normal human urine. It is also possible to incorporate correlations between metabolites, as well as small changes in the chemical shift of individual resonances due to differences in pH or ionic strength between samples, although these features were not used in the glycerate simulations. The first set of simulations aimed at reproducing the best score observed with our data set by modeling the effect of the genotype on the trait. We generated two groups with a mean intensity at the glycerate metabolite scaled to 4.65 (sd = 1.60) and 11.23 (sd = 3.60). A background of the most abundant NMR observable metabolites in human urine was included. NMR profiles from group 1 or 2 were associated with each of the 168 animals according to its genotype at the top marker. We then performed the same analysis as for the real data on this set of simulated profiles to determine if the association with glycerate was detectable with a similar strength. The second group of simulations was performed under the null hypothesis of no association between NMR profiles and genetic location. A set of NMR profiles was generated using the default MetAssimulo parameters for human urine for 500 cohorts of 168 animals. These profiles were randomly associated with existing animals to form 500 independent cohorts. Our method was applied for each of these cohorts, and the best overall LOD score and associated metabolite/genetic location were kept to define the distribution. False Discovery Rate

Statistical significance of the results was assessed at both the genome-wide and metabolome-genome-wide levels. Genomewide false discovery rate (FDR) of a LOD for a given trait was calculated using the built-in permutation of the R/QTL

Untargeted Dimension Reduction by Peak Alignment and Recognition

To efficiently reduce both the dimensionality of the data set and the impact of multiple testing corrections that were 636

dx.doi.org/10.1021/pr200566t |J. Proteome Res. 2012, 11, 631–642

Journal of Proteome Research

ARTICLE

Figure 2. Results from QTL mapping across the whole urine 1H NMR metabolomic profile in F2 (129S6xBALB/c) hybrid mice using the rectangle summary statistics. Details are given for the full profiles of all LOD scores at each genetic locus and resonance in 3D (a), LOD profile across all 1H NMR resonances for the genetic locus C9loc65 showing the strongest evidence of statistically significant linkage (b), LOD profile across the whole genome for the resonance 4.097 ppm (c) and distribution of the LOD score for the whole analysis (d).

important caveats in our previous mQTL analyses,11 we developed a two-step strategy by addressing peak positional noise first and then resolving signal redundancy. RSPA and SRV allowed removal of chemical shift variation17 (Figure 1a), scanning of the aligned spectroscopic data set18 to identify local clusters of intrinsically correlated neighboring variables (Figure 1c) and summarizing SRV clusters in different measures (see Methods) (Figure 1b). Results from the various summary statistics in this new approach were compared under simulated data (Figure 1d) using the running window method13 as reference. Resulting LOD score thresholds at 5% FDR were reduced and more robust across all measures when spectral alignment methods were applied (Figure 1d), indicating that high LOD scores accounting for linkages to background noise were removed. We were able to reduce the original number from 13 457 measures to a few hundred with the various summary measures. Different summary statistics provide a different correlation between consecutive peaks (see Methods), and therefore different number of overall peaks were obtained for sum (343), mean (343), median (344), max (372), trapezium

(479), and rectangle (480). There was good agreement between the various methods both in terms of peak identification as well as border detection, with >92% consensus and up to 97% if limited to mean, median, max, and sum (Figure 1c). As expected from the similarity of their baseline-corrected definitions, the rectangle and trapezium statistics identified the most consistent peaks (91% match), which were slightly more conservative than the other measures. QTL Mapping Analysis of Untargeted Metabolic Peak Summaries

To evaluate the performance and reliability of mQTL mapping with these intermediate spectral quantitative variables, traits calculated with each of the summary statistics were used to test for association with genetic loci by QTL analysis. Owing to variations in the definition of peaks with the various summary statistics, results from mQTL analysis (Table 1, Figure 2a) were collated at spectroscopic (Figure 2b) and genetic location (Figure 2c) levels. This also included a statistical significance at both genome-wide and metabolome-genome-wide levels (Figure 2d) to give a more appropriate 637

dx.doi.org/10.1021/pr200566t |J. Proteome Res. 2012, 11, 631–642

Journal of Proteome Research

ARTICLE

Figure 3. Genome wide linkage mapping of selected rectangle summary statistics derived from urine 1H NMR metabolomic profiling in F2 (129S6xBALB/c) hybrid mice. QTL mapping profiles are shown for various 1H NMR resonances (ppm) showing the strongest evidence of statistically significant linkage (LOD > 9) as detailed in Table 1.

assessment of the efficiency of the method (Supplementary Table 1, Supporting Information). Top LOD scores were very consistent across the various summary statistics in terms of genetic location and NMR peak (Table 1). All measures identified the strongest evidence of genetic linkage for resonances between 4.086 ppm and 4.092 ppm and the locus C9loc60 on mouse chromosome 9 (Figure 2c, Supplementary Figures 15, Supporting Information). However, we noted important differences in statistical significance between these measures, ranging from LOD 22.6 (trapezium) to LOD 42 (max), indicating that NMR peak calculation methods, including baseline removal, have a strong impact on QTL detection. Other hits were identified with lower, but more consistent, significant LOD scores. For example, evidence of

genetic linkages was detected between chromosome 2 loci and metabolites characterized by peaks at 0.940 ppm and 2.160 ppm, and between chromosome 13 and peaks at 1.542 ppm, 1.569 ppm, 2.215 ppm, 2.428 ppm, and 2.452 ppm (Table 1, Figure 3). Results from QTL analysis of each spectral peak trait quantification method were compared to statistics obtained with the original running-window method11 and its 13 457 original variables. The same top LOD scores were generally identified with this approach, but their statistical significance was greatly reduced due to the hundred times larger number of tests that have to be performed for each shift of the running window. Furthermore, with improved peak characterization, more biologically meaningful measures were tested and the LOD scores were much larger. For example, genetic linkage between C9loc60 and 4.092 638

dx.doi.org/10.1021/pr200566t |J. Proteome Res. 2012, 11, 631–642

Journal of Proteome Research

ARTICLE

ppm was 25.2 when the running-window approach was used after spectral alignment by RSPA, and it increased to 42 for the SRV with Max as a summary statistic.

uniquely high LOD score observed in the experiment at 4.09 ppm could reflect a phenomenon not modeled by either our simple simulation associating directly the mean intensity to genotype or our choice of summary statistics for peak characterization in the mQTL analysis.

Structural Assignment of Untargeted mQTLs

To identify metabolites underlying the most statistically significant mQTLs linked to NMR spectral data, we used metabolomic data repositories which provide details of peak positions for a wide range of compounds. Using the NMR profile at the resonance cluster giving the best LOD score at C9loc60, we were able to show that the peak of mean resonance 4.092 ppm corresponds to glycerate (Figure 1c). Interestingly, glycerate is characterized by a 1H NMR spectrum with three distinctive multiplet bands at 4.09 ppm, 3.703.74 ppm, and 3.803.84 ppm. The latter resonances also showed evidence of significant linkages to the locus C9loc60. However, even though these linkages remained in the top percentile, LOD scores were much lower than that observed for the peak at 4.09 ppm (Table 1), suggesting possible interferences with overlapping resonances, possibly from carbohydrate molecules (e.g., glucose has multiple peaks in this region). To assess the underlying genetic effect of the locus on glycerate regulation, putative glycerate signal intensities were determined for the three genotypes at the locus C9loc60 from the NMR profile at 4.09 ppm. Mean urine glycerate values were 11.22 ( 3.60 in 129S6 homozygous mice, 4.63 ( 1.58 in heterozygous mice, and 4.65 ( 1.63 in BALB/c homozygous mice, indicating that 129S6 alleles at the locus are associated with increased urine glycerate accumulation through a recessive mode.

Metabolic Profile Simulations under the Null

As the evaluation of the significance of a classic QTL is well documented, we used the built-in function from the R/QTL package to calculate the genome-wide significance of our hits (Supplementary Table 1, Supporting Information).23 However, we performed QTL tests across the metabolome for as many traits as identified by the SRV, which therefore required a means to determine metabolome-genome-wide significance of our results. In order to evaluate more generally the sensitivity and specificity of our approach of spectral peak quantification with various summary statistics, we simulated NMR profiles for 500 null cohorts with a single mix of characteristics, that is, without association to any genotype. After alignment of the simulated NMR profiles, it was possible to identify automatically the peaks and reduce the original number of 13 457 measures by a factor 30 using SRV with the summary statistics, including trapezium (506 ( 8.9), rectangle (505 ( 9.0), sum (497 ( 8.9), mean (497 ( 9.0), median (507 ( 8.7), and max (492 ( 8.8). In each cohort, the generated traits were then tested independently for association with a genetic locus by QTL analysis. The distribution of the best LOD scores genome- and metabolomewide across the 500 cohorts provides estimates of the significance level of our experimental results for each summary statistic (Figure 1d). For example, the overall best LOD score of 42.28 is obtained at resonance 4.09 ppm with the max as summary statistic, while the best score obtained from the 500 simulations under the null is 10, leading to high significance of the best LOD score of 42.28 (p < 0.002). Similarly, the significance of the LOD score at the two other resonances associated with glycerate at 3.73 ppm and 3.83 ppm are 0.002 for the LOD of 8.44 and 0.008 for the LOD of 7.30, respectively. Inversely, we were able to define from the same distributions the top 95% LOD scores expected from a run to stand at 6.72 (sum), 6.55 (max), 6.21 (median), 6.62 (mean), 12.92 (trapezium), and 8.81 (rectangle). Even though the sum, median, and mean summary statistics provide far fewer LOD scores above 6 (Table 1), their significance can be higher. There were 5 significant scores at 95% for the sum, 8 for the max, 6 for the median, 5 for the mean, 5 for the trapezium, and 14 for the rectangle. These data provide statistical support to the choice of the most appropriate summary statistics, as rectangle provides extra significant scores, leading to a greater sensitivity. Moreover, this measure is the most biologically relevant, characterizing a metabolite’s intensity by the area under the NMR curve, after correcting for the nonzero baseline caused by overlap with nearby peaks from other metabolites. Finally, these simulations of NMR profile based on the real genotypes could highlight a systematic bias in the mQTL due to problem with the markers. None of the three key genetic loci (C2loc50, C13loc5, C9loc60) were over-represented in the simulations LOD score under the null (data not shown).

From Untargeted mQTL to Mechanistically Relevant Associations

In an attempt to validate the mQTL linked to glycerate, we searched for candidate genes involved in glycerate processing at the locus c9loc60. We identified on chromosome 9qF1 the gene encoding glycerate kinase (GLYCTK) located directly under mQTL peak associated with glycerate. We resequenced the gene in both 129S6 and BALB/c progenitor strains of the F2 cohort, and identified a frameshift mutation causing a premature stop codon in the 129S6 strain, whereas the gene in BALB6/c remains unchanged when compared to the C57BL6/J reference sequence. This mutation unequivocally links this gene to deficient glycerate kinase activity and excess of urine glycerate in 129S6 mice. Simulations of 1H NMR Data Sets Using Experimental Dispersion Parameters

To assess QTL robustness, we carried out modeling of the genetic mechanism using simulations matching our experiment as closely as possible for comparison. We used MetAssimulo22 to generate simulated NMR profiles with characteristic glycerate intensities of the 129S6 homozygous (11.2), heterozygous (4.6), and BALB/c homozygous (4.6) genotypes at C9loc60. Processing the simulated NMR profiles through our method, we were able to confirm that the recoupled peaks at 4.09 ppm were the most significantly associated across the various summary statistics. Furthermore, the other two peaks associated with glycerate (3.71 and 3.81 ppm) showed similar LOD scores as expected from MetAssimulo, which works at the metabolite rather than resonance level. However, the top LOD score was only 13 across the three spectral locations. This is very much in line with the experimental observations at 3.71 and 3.81 ppm, but not 4.09 ppm, which was far more significant across all summary statistics. Therefore, the

Mapping Genetic Determinants of Metabolic Pathways

On several occasions, we identified colocalized mQTLs for apparently distinct metabolite signatures, suggesting a global impact of genetic loci on pathways rather than pleiotropic effects (Table 1, Figure 3). For example, marker loci in a region of mouse chromosome 2 (rs8281185-C2loc50; 50-62cM) showed 639

dx.doi.org/10.1021/pr200566t |J. Proteome Res. 2012, 11, 631–642

Journal of Proteome Research

ARTICLE

evidence of linkages to resonances at 0.941 ppm, potentially corresponding to alpha-ketoisocaproate or isoleucine and 2.167 ppm corresponding to butyrate, folate, homoserine, methionine, formiminoglutamate, gamma-carboxyglutamate, or sebacate. Along the same line, the region of marker locus C9loc60 was linked to glycerate and other metabolites at 3.741 ppm (LOD = 12.4) and 3.841 ppm (LOD = 18.7) potentially corresponding to glucose, and 4.240 ppm corresponding to phosphoserine (LOD = 10.1), suggesting that glycerate kinase deficiency in 129S6 mice has broad metabolic repercussions on the cellular regulation and urinary excretion of a number of metabolites.

may lead to the characterization of three traits instead of the actual two independent ones. How much to increase the number of traits tested by more stringent criteria for collapsing of adjoining intensities is left as a choice, as a lesser amount of correlation between peaks might correspond to multiple metabolites linked within a single pathway. Furthermore, reduced dimensionality has the non-negligible effect of speeding up analysis after initial peak identification. In practice, reduction in the number of mQTL analyses to be performed by a hundredfold results in a proportional cut in processing time, thus facilitating CPU intensive permutation tests required to evaluate statistical significance. Moreover, unlike the original running window approach, these traits can now be considered independently and computed in parallel, providing further reduction in processing time. The most significant mQTL detected in the cross, which was linked to increased urine glycerate concentration, is most likely caused by a frameshift mutation in 129S6 mice at the locus that prevents glycerate kinase synthesis. This result is in line with previous metabolomic analyses in 129S6 and BALB/c mice, which showed that glycerate is among the most significant strain specific metabolites in these models and that plasma glycerate concentration is significantly more elevated in 129S6 than in BALB/c mice.15,24 Deficiency of glycerate kinase prevents the conversion of glycerate to 3-phosphoglycerate, leading to increased excretion of glycerate which causes hyperoxaluria-II and glyceric acidemia and aciduria in humans.25,26 Phenotypes relevant to the human conditions have not been reported in 129S6 mice. Glycerate kinase also interacts with a thyroid hormone receptor interactor (TRIP13) and thyroid hormone receptor beta (THRB),27 which may partly explain differential adaptations to high fat diet feeding in 129S6 and BALB/c.15 Broader perspectives of genetic mapping metabolomic data, which can be best achieved in segregating populations,28 lie in the possibility to test causal relationships between a genetic locus and a biological pathway through mapping coordinated changes in the abundance of functionally linked metabolites. Glycerate is an intermediate in the degradation of serine and the metabolism of fructose and its accumulation may have important repercussions on the regulation of these pathways. We identified colocalized mQTLs in the region of mouse chromosome 9 close to marker locus C9loc60 with metabolites known to be involved in similar pathways (e.g., glycerate, serine, indole-3-pyruvate) as well as metabolites that could not be unambiguously identified. Even though these mQTLs may be controlled by distinct causative genetic variants or underlie pleiotropic effects of a genetic locus near C9loc60, they may equally account for metabolic adaptations of glycerate kinase deficiency. Integrative analysis of metabolomic data and other functional genomic data sets interrogating different levels of gene expression regulation may shed light on coordinately regulated functional networks12,28 that may have biological signatures at the gene transcription level in 129S6 and BALB/c mice.29 In conclusion, we have developed novel automated analytical tools and methods to map the genetic determinants of the metabolome, which broaden the analysis of the genetic control of the abundance of molecular compounds across the full NMR spectrum and lead to an up to 34-fold increase in the number of molecules that are considered in genetic analyses when a targeted approach is applied. Further progress in untargeted metabolome QTL analysis largely depends on improvements in NMR spectral annotations that can assign metabolite names to spectral peaks.

’ DISCUSSION We have demonstrated the genetic mapping of quantitative variables of the metabolome using novel analytical tools specifically designed to systematically and accurately dissect and quantify 1H NMR spectral data from biological samples of genetically heterogeneous individuals. We propose a novel and balanced strategy for metabolomic QTL mapping, which takes full advantage of large data sets that increasingly effective and powerful functional genomic platforms generate, in order to extract a maximum of biological information for genetic analysis, while applying essential dimension reduction by removing strongly correlated signals that mark the same molecular compounds to improve statistical performance. These tools, optimized with urine metabolomic data from mice of an F2 (129S6xBALB/c) cross, allowed maximization of the accuracy, robustness, and significance of metabolomic QTLs, and pave the way to realistic applications of untargeted metabolomic approach in GWAS of disease intermediate and potentially predictive metabotypes in humans. Functional genomic technologies are powerful high density molecular phenotyping generation systems to interpret the biology of GWAS results2 and to assist the discovery of disease genes and biomarkers. The use of an untargeted approach for QTL mapping of metabolomic profiles, which produce a large number of molecular quantitative phenotypes,5,7 poses the problem of multiple testing that affects the power to detect statistically significant signals. The combined use of RSPA17 and SRV18 after NMR spectral alignment successfully reduced the total number of tests performed from 13 457 to