mQTL.NMR: An Integrated Suite for Genetic Mapping of Quantitative

Mar 24, 2015 - (1) Metabolomics is a powerful systems biology approach which analyzes the multivariate data representing a range of small molecules in...
0 downloads 4 Views 1MB Size
Subscriber access provided by SUNY DOWNSTATE

Article

mQTL.NMR: An integrated suite for genetic mapping of quantitative variations of 1H NMR-based metabolic profiles Lyamine Hedjazi, Dominique Gauguier, Pierre Zalloua, Jeremy Kirk Nicholson, Marc-Emmanuel Dumas, and Jean-Baptiste Cazier Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.5b00145 • Publication Date (Web): 24 Mar 2015 Downloaded from http://pubs.acs.org on March 28, 2015

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Analytical Chemistry is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

mQTL.NMR: An integrated suite for genetic mapping of quantitative variations of 1H NMR-based metabolic profiles

Lyamine Hedjazi,† Dominique Gauguier,†,§ Pierre Zalloua,$ Jeremy Nicholson,‡ Marc-Emmanuel Dumas*,‡ and Jean-Baptiste Cazier*,║,#



Institute of Cardiometabolism and Nutrition (ICAN), Department of Omics Sciences, University

Pierre & Marie Curie, 91 boulevard de l'Hôpital, 75013 Paris, France. $

Lebanese American University, School of Medicine, Beirut, Lebanon



Imperial College London, Section of Biomolecular Medicine, Division of Computational and

Systems Medicine, Department of Surgery and Cancer, Faculty of Medicine, Sir Alexander Fleming building, London SW7 2AZ, UK. §

Sorbonne Universities, University Pierre & Marie Curie, University Paris Descartes, Sorbonne

Paris Cité, INSERM UMR_S 1138, Cordeliers Research Centre, 75006 Paris, France. ║

Department of Oncology, University of Oxford, Roosevelt Drive, Oxford OX3 7DQ, UK.

#

Centre for Computational Biology, University of Birmingham, Haworth Building, Edgbaston

B15 2TT, UK.

KEYWORDS. Metabonomics, Genetics, Metabotypes, QTL mapping, Chemo-informatics, Functional genomics

ACS Paragon Plus Environment

1

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 27

ABSTRACT High-throughput 1H Nuclear Magnetic Resonance (NMR)-based is an increasingly popular robust approach for qualitative and quantitative metabolic profiling, which can be used in conjunction with genomic techniques to discover novel genetic associations through metabotype Quantitative Trait Locus (mQTL) mapping. There is therefore a crucial necessity to develop specialized tools for an accurate detection and unbiased interpretability of the geneticallydetermined metabolic signals. Here we introduce and implement a combined chemo-informatic approach for objective and systematic analysis of untargeted 1H NMR-based metabolic profiles in quantitative genetic contexts. The R/Bioconductor mQTL.NMR package was designed to: (i) perform a series of pre-processing steps restoring spectral dependency in collinear NMR datasets to reduce the multiple testing burden (ii) carry out robust and accurate mQTL mapping in human cohorts as well as in rodent models (iii) statistically enhance structural assignment of geneticallydetermined metabolites and (iv) illustrate results with a series of visualization tools. Built-in flexibility and implementation in the powerful R/Bioconductor framework allow key preprocessing steps such as peak alignment, normalization or dimensionality reduction to be tailored to specific problems. The mQTL.NMR package is freely available with its source code through the Comprehensive R/Bioconductor repository and own website (http://www.icaninstitute.org/tools/). It represents a significant advance to facilitate untargeted metabolomic data processing and quantitative analysis and their genetic mapping.

ACS Paragon Plus Environment

2

Page 3 of 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

INTRODUCTION Deciphering the genetic determinants of complex disorders is rapidly evolving from pure genetic association studies based on disease status to the analysis of the genetic control of molecular phenotypes used as disease biomarkers. 1 Metabolomics is a powerful systems biology approach which analyzes the multivariate data representing a range of small molecules in a biological sample.

2 , 3

Genetic studies of the metabolome have until now mainly focused on targeted

approaches reporting known metabolites present in a biological sample. 4 , 5 Nuclear Magnetic Resonance (NMR) easily operates in the low µM range where a very large number (100s to 1,000s) of important metabolic intermediates can be detected.

6

NMR measures basic

constituents of metabolism and hence is an excellent unbiased detector for metabolic phenotypes (i.e. metabotypes).

7

Despite its attractive computational properties, selective metabolite

quantification in NMR profiles remains prone to investigator-led bias.

8

Targeted profiling

misses a large proportion of molecular compounds, which are not selectively quantified despite their presence in the sample and their potential relevance to the investigated condition. However, when performed in a standardized way, high-resolution 1H NMR is a particularly reliable and robust technique for metabolomic applications thanks to its exceptional reproducibility. 9 – 11 It can be efficiently used for repeated and longitudinal molecular phenotyping to assess genetic association between specific metabolites and genomic regions. 12

– 15

Unfortunately, these revolutionary advancements in NMR technology were not accompanied with a sufficient software development for data preprocessing and analysis. In the context of genetic studies, the high-density spectral data generated by this technology exhibit generally a high dimensionality and some degree of variability that make the task of systematic acquisition and quantification of metabolite signals challenging. Such issues become critical for profiling

ACS Paragon Plus Environment

3

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 27

comparison in large cohorts of genetically and phenotypically heterogeneous individuals.

14

There is therefore a crucial need to develop a new suite of integrated tools for robust, systematic and accurate analysis of untargeted quantitative metabotypes generated by 1H NMR and their genetic mapping. Although several software packages have been developed in the last decade for NMR metabolomics applications, 16 – 22 there is still a strong need for an integrated and well-validated pipeline for mapping the genetic determinants. We present here the R/Bioconductor package mQTL.NMR specifically designed to integrate objective genetic mapping of untargeted 1H NMR–based metabotypes. The package is a ready-to-use analytical tool, which provides several functionalities ranging from preprocessing of NMR data (including normalization, scaling, alignment and dimensionality reduction) to visualization of mQTL mapping results. Normalization and scaling are key preprocessing steps of NMR data to identify and remove sources of variability related to sample dilution or variation in instrument detector sensitivity. 23 26



The mQTL.NMR package implements the most used techniques of normalization and scaling

by metabolomics community. Another important step of preprocessing of NMR metabolomics data concerns the spectral alignment that consists in correcting the variation of chemical shift position (i.e. positional noise

27

) for specific metabolites caused by changes in pH, salt

concentration and temperature. 28 This issue of local positional variations is taken into account within the mQTL.NMR package by using Recursive Segment-wise Peak alignment approach (RSPA). 29 Similarly to other "omics" platforms, the high-resolution of NMR profiling comes with a cost: the high dimensionality of data and its associated multiple testing corrections.

14

Two

ACS Paragon Plus Environment

4

Page 5 of 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

dimensionality reduction approaches are implemented within mQTL.NMR package: binning approach

30

and Statistical Recoupling of Variables

31

(SRV). Furthermore, the results of

dimensionality reduction can be visualized using dedicated functions for checking purposes. Finally, mQTL.NMR package provides complete support for performing mQTL analysis, according to whether human population or animal models are considered, based on several regression approaches. Permutation tests are also available in order to estimate a significance threshold. Several visualization functions are available in order to help in structural assignment and metabolite identification, as well as a tailored circular plotting tool associating chemical shifts to genomic positions. We provide in next sections more detailed description of the different functionalities implemented within the mQTL.NMR package with an application case using an existent NMR metabolic dataset.

ACS Paragon Plus Environment

5

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 27

EXPERIMENTAL SECTION Implementation and Workflow The mQTL.NMR package consists of a R-based suite of functions designed to (i) preprocess untargeted 1H NMR spectral metabotypes, (ii) perform statistical testing of association between genetic loci and quantitative metabotypes and (iii) visualize the outputs with a particular focus on structural assignment (Figure 1). It requires input of the raw metabolomic data, formatted in a standard way on the NMR spectrometer workstation through Fourier Transform, apodization, calibration, phasing and baseline-correction, before import as text file. The user should also provide separately additional information about the genetic and structure of the population (e.g. sex, age or relatedness between individuals). For experimental crosses, it is expected that genetic maps have been calculated with actual genotype data to verify genotype accuracy. We give below a brief description of the main tasks that can be performed by the mQTL.NMR package with metabolomic spectral and genotype data, including the analysis and visual output that can be generated. Normalization and Scaling of 1H NMR Spectral Metabotypes Several approaches have been proposed to perform the normalization task of NMR spectral data. 25

These methods can be grouped into two main categories: methods that remove unwanted

sample-to-sample variation, and those adjusting the variance of different metabolites by variable scaling 26 (Figure S1). Recent comparative studies have suggested that normalization is contextdependent. 25 For this reason, we implemented the most widely used approaches normalization and scaling methods within our package. Normalization methods include constant sum constant noise, linear baseline

33

and probabilistic quotient

34

32

,

normalization approaches. The

ACS Paragon Plus Environment

6

Page 7 of 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

mQTL.NMR package implements two scaling approaches: auto-scaling 35 and Pareto scaling 36 (Table S1). It should be noted that for the Probabilistic Quotient Normalization (PQN) approach, the median spectrum is the default reference spectrum but the user can select the reference spectrum as well. Overall flexibility is given to the user at the normalization stage to make possible the use of these different methods recursively without any restriction on the order. Peak Alignment of 1H NMR Spectral Data The mQTL.NMR package uses the Recursive Segment-wise Peak alignment (RSPA) approach 29 to align positions of NMR spectral peaks with respect to a reference spectrum chosen either automatically or manually. The RSPA approach has been recently introduced and evaluated to exhibit the comparative advantages to other peak alignment approaches over a wide range of peak intensities.

29

RSPA aims at locally realigning peak positions of a 1H NMR spectrum (test

spectrum) with respect to peak positions of a reference spectrum. There are two ways to specify the reference spectrum for alignment within the mQTL package either based on the default configuration that consists of selecting a reference spectrum automatically (based on a fuzzy correlation among whole spectra) or manually. In RSPA, reference and test spectra are firstly segmented based on Savitzky-Golay filter. The peak shifts in each test segment s(i) are then recursively corrected to match peak positions of its corresponding reference segment r(i) by subdividing initial large test segment into smaller ones. RSPA is mainly based on maximization of the following cross correlation function:

,  =    + 

(1)

 =  max,  

2



ACS Paragon Plus Environment

7

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 27

where d is a shift between reference (r) and test (s) segments and N is a segment length. In order to reduce its computational complexity, the calculation of the cross-correlation function is accelerated by the Fast Fourier Transform (FFT) as follows: 

,  ↔R(j)S*(j)

(3)

where R and S are respectively the Fourier Transform of r and s, and

*

denotes a complex

conjugation. Dimensionality Reduction of NMR Spectroscopic Data Whilst the majority of NMR-based studies have used high-resolution processing for over a decade,

27

the collinear nature of these high-resolution NMR datasets makes multiple testing

difficult to solve for genetic mapping applications. The mQTL.NMR package offers two approaches to reduce the number of traits considered for genetic mapping analysis: bucketing 30 and Statistical Recoupling of Variables 31 (SRV) approaches. We recommend the SRV approach which derives local sets of consecutive variables with a similar variation pattern using a local spectral dependency measure based on the ratio between covariance and correlation of two consecutive variables (ex. i and i+1) as follows:   =

!"#"$ %

##%&"$

 , + 1

(4a)

////// L(i)=( ∑  − +̅ ∑. + 1 − + + 10 



-

(4b)

The SRV algorithm is then used to test whether consecutive clusters are correlated to identify potential neighboring peaks and aggregate them into a single supercluster. A variety of statistical

ACS Paragon Plus Environment

8

Page 9 of 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

measures can be then used to summarize each cluster (peak) identified by the SRV algorithm. Basic measures such as median, maximum and sum (area under the curve) can be used to summarize a peak. However, more accurate measures based on baseline correction such as rectangle and trapezium subtraction (Figure S2) have been recently proposed by Cazier et al. 14 Subtracting a rectangle below the minimum value, or a trapezium below the extreme values could remove potential baseline variation caused by other NMR signals. SRV corresponds to a significant improvement compared to bucketing

30

(or binning), which was introduced for

dimensionality reduction in early years. While in bucketing, a data reduction is performed by simply grouping consecutive spectral variables: the spectra are divided into evenly spaced windows whose width commonly ranges between 0.001 and 0.05 ppm (i.e. bins or buckets). The intensities inside each bucket are summed, so that the area under each spectral region (i.e. integral) is used to summarise the whole bucket. Eq.(5) summarizes the bucketing procedure applied to a data matrix X with samples in rows, variables in columns and elements xij, where i=1, 2, …, I and j=1, 2, …, J. For each sample i, xij is an intensity in the raw signal at point j. The N parameter is the number of data points in each bucket and can be calculated by the ratio between the bucket width and the sampling. The K parameter is the final number of buckets: 12 =

7∗9

3

6:7∗9;