Data Normalization of 1H NMR Metabolite ... - ACS Publications

Jul 6, 2015 - Data Normalization of 1H NMR Metabolite Fingerprinting Data Sets in the .... Environmental metabolomics with data science for investigat...
0 downloads 0 Views 3MB Size
Subscriber access provided by NEW YORK UNIV

Article 1

Data Normalization of H-NMR Metabolite Fingerprinting Datasets in the Presence of Unbalanced Metabolite Regulation Jochen Hochrein, Helena U. Zacharias, Franziska Taruttis, Claudia Samol, Julia C. Engelmann, Rainer Spang, Peter J. Oefner, and Wolfram Gronwald J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.5b00192 • Publication Date (Web): 06 Jul 2015 Downloaded from http://pubs.acs.org on July 11, 2015

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Data Normalization of

1

H-NMR Metabolite Fingerprinting Datasets in the

Presence of Unbalanced Metabolite Regulation Jochen Hochrein, Helena U. Zacharias, Franziska Taruttis, Claudia Samol, Julia C. Engelmann, Rainer Spang, Peter J. Oefner and Wolfram Gronwald* Institute of Functional Genomics, University of Regensburg, Josef-Engert-Str. 9, 93053 Regensburg, Germany

*To whom correspondence should be addressed Wolfram Gronwald, Institute of Functional Genomics, University of Regensburg, Josef-Engert-Str. 9, 93053 Regensburg, Germany Email: [email protected] Phone: ++49-941-943-5015 Fax: ++49-941-943-5020

ACS Paragon Plus Environment

1

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 35

Abstract Data normalization is an essential step in NMR-based metabolomics. Conducted properly, it improves data quality and removes unwanted biases. The choice of the appropriate normalization method is critical and depends on the inherent properties of the dataset in question. In particular, the presence of unbalanced metabolic regulation, where the different specimens and cohorts under investigation do not contain approximately equal shares of up- and down-regulated features, may strongly influence data normalization. Here, we demonstrate the suitability of the ShapiroWilk-Test to detect such unbalanced regulation. Next, employing a Latin-square design consisting of eight metabolites spiked into a urine specimen at eight different known concentrations, we show that commonly used normalization and scaling methods fail to retrieve true metabolite concentrations in the presence of increasing amounts of glucose added to simulate unbalanced regulation. However, by learning the normalization parameters on a subset of non-regulated features only, Linear Baseline

Normalization,

Probabilistic

Quotient

Normalization

and

Variance

Stabilization Normalization were found to account well for different dilutions of the samples without distorting the true spike-in levels even in the presence of marked unbalanced metabolic regulation. Finally, the methods described were applied successfully to a real world example of unbalanced regulation, namely a set of plasma specimens collected from patients with and without acute kidney injury after cardiac surgery with cardiopulmonary bypass use. Keywords:

NMR,

data

normalization,

metabolomics,

unbalanced

regulation,

confounding factors

Introduction Metabolomics aims at the comprehensive analysis of small organic compounds in cells, tissues, organs, organisms, and body fluids for elucidating molecular mechanisms and identifying diagnostic and prognostic biomarkers.1 Hyphenated mass spectrometry and nuclear magnetic resonance (NMR) spectroscopy are the main tools employed in metabolomics. Here, we focus on the application of solution NMR spectroscopy. Due to the chemical complexity of biological specimens2 and the typical direct application of NMR to untreated body fluids, NMR spectra contain a correspondingly large number of spectral features that require the use of multivariate ACS Paragon Plus Environment

2

Page 3 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

statistical techniques1 to unveil the variance of individual metabolite concentrations and their joint covariance structure. For this, data should ideally reflect only metabolic differences that are induced by disease and/or in response to treatment, with other sources of biological and technical variation kept to a minimum. A common source of non-induced variance stems from the large dynamic range of metabolite concentrations in biomedical specimens,3 which gives rise to unequal variance of residuals, as the variability of the concentration of a given metabolite often correlates with its mean concentration, thus leading to considerable heteroscedasticity in the data. Differences in fluid intake prior to the collection of urine specimens are another common source of unwanted sample-to-sample variation in global metabolite concentrations. Further sources are experimental error and technical variation. The aim of data normalization techniques is the reliable reduction of these types of unwanted variances while preserving the differences of interest. There is a multitude of normalization methods available, the simplest being linear normalization methods4 that assign an appropriate weight to each sample to achieve comparability across samples. However, for these and other methods that require the initial spectral mapping to a reference spectrum, it is generally assumed that only a relatively small proportion of the metabolites is regulated in approximately equal shares up and down, also termed the self averaging property of the data.5 This implies that average total spectral areas are similar across specimens and groups. Whenever this assumption is violated, erroneous results may be obtained such as the reduction of differences between truly differentially expressed features and the introduction of erroneous inter-group differences for non-regulated features. Both will severely confound the results of the subsequent statistical data analysis. Study of actual biomedical datasets often reveals the violation of the above assumption. A typical example is kidney failure,6 where impaired renal elimination of endogenous and exogenous compounds leads to their accumulation in blood without a corresponding equal decrease in the abundance of other blood constituents. The goals of this paper are twofold. First, we aim at identifying a method for the reliable detection of unbalanced regulation to prevent improper normalization of metabolic data. Second, in the presence of unbalanced regulation, we aim at identifying normalization methods that are relatively insensitive to such distortion. For this, we first test a set of different normalization methods where the parameters of the normalization approach in question are determined on the complete metabolic

ACS Paragon Plus Environment

3

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

fingerprinting dataset. The methods investigated include Probabilistic Quotient Normalization,7 Cyclic Loess Normalization,8,9 Contrast Normalization,10 Quantile Normalization,4 Linear Baseline Normalization,4 Li-Wong Normalization,11 Cubic Splines Normalization,12 Auto Scaling,13 Pareto Scaling,14 and Variance Stabilization Normalization.15 Of these, the first eight methods perform object-wise transformations with the aim of reducing between-sample variation, while the latter three methods perform feature-wise transformations that aim at adjusting the variance of the different features. An exception is Variance Stabilization Normalization, which performs both tasks. Methods that aim at the normalization of low dimensional metabolic data16 were not investigated. Next, as unbalanced regulation is caused by a smaller or larger subset of all features, we devised an approach to constrain learning of the parameters of the respective normalization approach to a subset of the data that ideally contains only nonregulated features. The use of non-regulated features for data normalization has been previously proposed for microarray17,11 and metabolomic data18,19 Similar to the approach proposed here, Pelz et al.17 identified a set of rank invariant genes. Alternatively, De Livera et al.18,19 proposed the use of so-called quality control features, i.e., internal standards, features correlating with an internal standard, or features known a priori not to be regulated, in a linear mixed model. For evaluating the performance of the different normalization methods in the presence of increasing unbalanced regulation we employ a quasi Latin-square design,20 where eight endogenous metabolites are added at varying concentrations to a specimen of human urine while keeping the total amount of added metabolites constant in each sample. To simulate increasing amounts of unbalanced regulation, glucose is added in different amounts. Glucose has been chosen because it is a metabolite commonly found in large quantities in biological specimens such as plasma and in the urine of diabetic patients. Moreover, it yields numerous signals in proton NMR spectra that may exert a strong effect on data normalization. In addition, as normalization strives for the removal of unwanted sample-to-sample variation caused, for example, in the case of urine by differences in fluid intake, we simulate this variation by varying the the numbers of scans for NMR data acquisition. The performance of the different data normalization methods is compared on the basis of the estimation of the actual fold changes in metabolite abundance. Finally, the methods described are applied to a set of plasma specimens obtained from patients with or without acute kidney injury

ACS Paragon Plus Environment

4

Page 4 of 35

Page 5 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

after cardiac surgery with cardiopulmonary bypass use.

Experimental section Datasets The first dataset consisted of a quasi Latin-square spike-in design. As background for the spike-in, a urine specimen was collected from a healthy volunteer at the University of Regensburg. Next, eight endogenous metabolites, namely DL-βaminoisobutyrate, L-alanine, citrate, creatinine, DL-ornithine, DL-valine, formate, and L-glutamine, all of which had been purchased from Sigma-Aldrich Chemie GmbH, Schnelldorf, Germany, were added to 32 aliquots of the urine specimen at concentrations ranging from 0.05 to 6.25 mmol/L in multiples of 2, resulting in eight different spike-in levels (Table 1). Added concentrations of individual metabolites were selected at random from these levels with the restraint that each metabolite and spike-in level, respectively, was added only once to each aliquot, thereby keeping the total concentration of added metabolites consistently at 12.45 mmol/L per sample. To systematically investigate the influence of varying amounts of confounder on normalization, D-glucose was added at final concentrations ranging from 1.563 to 200 mmol/L in multiples of two. This yielded eight different glucose levels with four samples each containing the same glucose level. For this, 10 mL of stock solutions were prepared for each compound by carefully weighing in the required amount of material to reach concentrations of 40 mmol/L each, with the exception of D-glucose, for which a 1 mol/L stock solution was prepared. The accuracy of the used scale was ±0.1 mg. To reach in each sample the desired concentrations and a constant volume, appropriate volumes of the stock solutions and pure water were added to the urine matrix. In addition, the spike-in dataset contained a blank sample containing the urine background only. It was used to determine the endogenous levels of the spiked-in compounds. These levels were subtracted before the calculation of fold changes. The second dataset comprised

1

H NMR fingerprints of n = 85 EDTA-plasma

specimens subjected to 10 kDa cut-off ultra filtration, which had been drawn from patients 24 h after cardiac surgery with cardiopulmonary bypass use at the University Clinic of Erlangen upon receipt of informed consent.21 Of the 85 patients, 33 had ACS Paragon Plus Environment

5

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

developed acute kidney injury (AKI) after surgery.

NMR spectroscopy To each 400-µL specimen of the Latin-square design and the AKI sample set, 200 µL of 0.1 mol/L phosphate buffer, pH 7.4, and 50 µL of deuterium oxide containing 0.75% (w/v) trimethylsilyl-2,2,3,3-tetradeuteropropionic acid (TSP) as a reference [Sigma-Aldrich] were added. All NMR experiments were performed employing NMR tubes with a diameter of 5 mm. 1D 22,23

previously

1

H spectra were measured as described

on a 600 MHz Bruker Avance III spectrometer [Bruker BioSpin GmbH,

Rheinstetten, Germany], which was equipped with a cryogenic probe with z-gradients and a cooled automatic sample changer. A 1D nuclear Overhauser enhancement spectroscopy (NOESY) pulse sequence was used in all cases and solvent signal suppression was achieved by presaturation during relaxation and mixing time [RD = 4.0 s; tm = 0.01 s]. The full line width at half height of the TSP reference signal amounted in all spectra of both the Latin-square and the AKI dataset to values between 0.7 Hz and 1.0 Hz. To simulate variations in overall metabolite concentration for the Latin-square design, spectra were acquired with different numbers of scans, which directly influence the obtained spectral intensities. Division of the number of scans by four corresponds to a twofold sample dilution. This was experimentally verified by comparing two samples of the Latin-square design that had been diluted by a factor of two with corresponding samples, for which the number of scans had been reduced by a factor of four. These comparisons yielded an average difference of 3.9% in signal-to-noise ratio indicating that dilution effects may be safely simulated by an appropriate reduction of the number of scans. As described above, the Latin-square dataset consists of 32 spectra assigned in equal proportion (n=4) to 8 different glucose levels. Each of the four spectra per glucose level was acquired with a different number of scans, namely 16, 32, 64, and 128 scans, respectively (Table 1). For the AKI dataset, each spectrum had been acquired with a constant number of 128 scans. For each dataset, all spectra were measured once, Fourier transformed, and phase corrected by automated routines. A flat baseline was obtained employing the baseopt option of TopSpin3.1 [Bruker BioSpin]. All spectra were chemical shift referenced relative to the TSP signal. Figure 1A shows a section of an exemplary spectrum of the AKI dataset. As EDTA-plasma samples were employed the amount of bivalent cations such as Mg2+ and Ca2+ could

ACS Paragon Plus Environment

6

Page 6 of 35

Page 7 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

be determined based on the signals of the corresponding EDTA complexes. In Figure 1B a spectrum of the Latin-square spike-in design is shown, with all spiked-in metabolites denoted. For data analysis in the R (vs. 3.1.0) statistical computing environment,24 bin (feature) tables were generated from the 1D 1H NMR spectra using AMIX 3.9.13 [Bruker BioSpin]. Signal positions between samples may shift due to slight changes in pH, salt concentration, and/or temperature. Here, we chose to apply equidistant binning to compensate for such shifts. Employing an optimized bin size of 0.01 ppm and taking the absolute value of bin integrals only, bins were generated in the region from 9.5 - 0.5 ppm. The optimized bin size of 0.01 ppm had been determined in previous investigations on sample classification25,26 using sets of both plasma and urine specimens by systematically varying the bin size until optimal cross-validated classification performance was obtained. For the Latin-square design, excluding the broad urea signal from 6.0 – 5.5 ppm and the solvent area from 5.0 – 4.7 ppm, 872 bins were obtained per spectrum. It was manually verified, that the regions surrounding the excluded solvent signal were not affected by baseline distortions. No normalization of the data was performed at this point. This resulted in a data matrix X = (xij) with rows i = 1I and I = 872 representing the feature or bin number, and columns j = 1J with J = 36 representing the number of acquired datasets. For the AKI dataset, a total of 718 spectral bins, excluding the spectral regions containing the water as well as the broad urea signal, the glycerol (filter preservative) and free EDTA signals, were obtained for each of the 85 spectra. Prior to subsequent normalization, signals of each spectrum were scaled to the integral of the reference TSP signal to correct for variations in spectrometer performance over time. This is especially important for large studies that require weeks to months of NMR data acquisition. However, it is important that the error introduced by the addition of the reference substance does not exceed variation in spectrometer performance. This we investigated in detail by preparing ten 400-µL aliquots of urine, to which we added 250 µL each of phosphate-buffer containing 50 µL deuterium oxide with 0.75% (w/v) TSP. All aliquots were measured 10 times each over a period of 2 weeks, yielding a total of 100 1D 1H spectra. The average relative standard deviation of the pipetting error calculated from the variation in signal intensity of the TSP reference signal between aliquots was 0.8%, while that of the spectrometer performance, defined as variation in TSP reference signal intensity within each aliquot across measurements,

ACS Paragon Plus Environment

7

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

amounted to 3.7%. Hence, it was concluded that normalization relative to the TSP signal indeed allowed to account for variation in spectrometer performance. To verify the precision of the preparation of the samples of the Latin-square design, we additionally acquired all samples with a uniform scan number of 128. The data was scaled with respect to the TSP reference signal to correct for differences in spectrometer performance. Next the blank spectrum containing the urine background only was subtracted from all other spectra and fold changes were determined for each spike-in. Ideally in case of no pipetting and measurement errors, there should be no difference between expected and observed fold changes. Across all glucose levels we obtained for all metabolites a median difference between expected and observed fold changes of 0.004 with a corresponding small inter-quartile range of 0.099.

Calculation of fold changes For the determination of fold changes we manually identified for each metabolite one non-overlapping signal. In case that a signal was split in two bins or that slight variations in signal positions due to for example pH differences were observed all corresponding bins were taken into account. The latter case was especially observed for citrate.

Basic characteristics of the normalization algorithms employed All normalization methods employed assume that NMR signal intensities scale linearly with metabolite concentration and are mostly independent of the chemical properties of the investigated molecules. The methods evaluated can be grouped in two groups. The first group aims at reducing between-sample variation, while the second group aims at adjusting the variance of the different features to reduce heteroscedasticity in the data. We have included methods from both groups in our evaluation. However, the focus is clearly on methods removing unwanted sample-tosample variation. The evaluated normalization approaches have been described previously. 9 Therefore, only a very short description of each method is given. Probabilistic Quotient Normalization7 assumes that biologically interesting concentration changes influence only parts of the NMR spectrum, while dilution effects will affect all metabolite signals. For each variable of interest the quotient of a

ACS Paragon Plus Environment

8

Page 8 of 35

Page 9 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

given test and reference spectrum is calculated and the median of all quotients is estimated. Finally, all variables of the test spectrum are divided by this median quotient. Cyclic Locally Weighted Regression (Cyclic Loess)8,9 is based on logged BlandAltman or MA-plots.27 The presence of nonlinear biases is assumed. Contrast Normalization10 also uses MA-plots and makes the same assumptions as Cyclic Loess. The data matrix of the input space is logged and transformed by means of an orthonormal transformation matrix T=(tij) into a contrast space. This expands the idea of MA-plots to several dimensions and converts the data into a set of rows representing orthonormal contrasts. A set of normalizing curves is then fitted similar to Cyclic Loess Normalization. Finally, data are mapped back into normal input space. The idea of Quantile Normalization4 is to bring all spectra to an identical distribution of intensities across features (bins).

Baseline Normalization. In contrast to normalizing the data to a measure of the full dataset, the data is normalized only to a subset of it, the so-called baseline. For example, the baseline may be constructed by calculating the median of each feature over all spectra.

Linear Baseline Normalization uses a scaling factor to map linearly from each spectrum to the baseline.4 Therefore, one assumes a constant linear relationship between each feature of a given spectrum and the baseline. In the version implemented by Kohl et al., 2012,28 the scaling factor was computed for each spectrum as the ratio of the mean intensity of the baseline to the mean intensity of the spectrum. To achieve greater independence from large confounding factors, we also calculated scaling factors on the basis of median intensities of the spectra.

A non-linear version of baseline normalization that also contains a feature selection step was described by Li and Wong.11 In this procedure, the feature intensities of a baseline spectrum having the median overall intensity are plotted against the feature

ACS Paragon Plus Environment

9

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

intensities of the spectrum to be normalized. Next, features of similar rank in both spectra are identified and a smoothing spline, which is subsequently used as the normalization curve, is fitted through these features. After normalization, the data should ideally align along the diagonal of the scatter plot.

The next method, which also assumes non-linear dependencies between individual features and the baseline, fits Cubic Splines between each spectrum and the baseline.12 In contrast to the original publication, the geometric mean used for computation of the baseline was replaced with the arithmetic mean for reasons of robustness to negative values.

The second group comprises methods used to adjust the variance of each spectral feature. The simplest of these approaches is called Auto Scaling or unit variance (uv) scaling.13 It results in every feature displaying a standard deviation of one, i.e. the data is transformed to standard units. Briefly, data is first centered by subtracting from each feature its mean value across spectra. From this centered data the standard deviation of each feature is obtained and all data points for this feature are divided by this scaling factor. Closely related is Pareto Scaling,14 which uses the square root of the standard deviation for scaling. Its scaling effect is less drastic so that the data stays closer to its original values.

Finally, the Variance Stabilization Normalization (VSN) R-package used here applies a combination of methods that first correct for between-sample variations by linear mapping of all spectra to the first spectrum and then adjust the variance of the data by means of an inverse hyperbolic sine.15

Detection of unbalanced regulation Analysis of intra- and inter-group inhomogeneities For the detection of unbalanced up-regulation of one or several compounds (see upper left part of Figure 2), we calculate first the total spectral area for each NMR spectrum, excluding areas of the solvent signal and, in case of urine, the broad urea signal. Then, we test the total spectral areas of the set of spectra of interest for

ACS Paragon Plus Environment

10

Page 10 of 35

Page 11 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

normal distribution using the Shapiro-Wilk test. Here, we assume that the total spectral areas are normally distributed around a fixed mean in the absence of unbalanced regulation, irrespective of the experimental groups. To detect single outliers within the investigated groups that may not be detected by the Shapiro-WilkTest, total spectral areas are also plotted in a histogram representation.

Decision making Only if the Shapiro-Wilk-Test indicates the absence of unbalanced regulation (p > 0.05), one may proceed with normalization procedures including spectral mapping such as Quantile Normalization as proposed by Kohl et al. 2012.28 In all other instances, a different strategy shall be employed.

Selection of reference features for data normalization in the presence of unbalanced regulation Following an approach previously proposed for the analysis of microarray data,17 a method is suggested for the identification of non-regulated features that serve as reference features for subsequent data normalization. We assume, that the variance of regulated features across all specimens is larger than that of non-regulated features. By employing the variance as filter criterion, features that exhibit a large variance across spectra will not serve as reference features. Note that this selection only affects computation of the parameters of the normalization procedure and that all features including reference and non-reference features will be ultimately normalized. For the identification of suitable reference features, all features are first ordered in descending order of their variance across samples (steps A and B in Figure 2). Next, the list of features serving as reference features is iteratively reduced from top to bottom until convergence of the normalization parameters is achieved. For each reference feature in a given spectrum, an individual scaling factor is calculated between the reference feature and the corresponding feature in the baseline (step C in Figure 2). Then, the standard deviation of all individual scaling factors is computed for each sample (step D in Figure 2). In case that the set of reference features consists mostly of non-regulated features, the standard deviation of the individual scaling factors of a given sample will approach a minimal value. It is assumed that technical biases or non-induced biological variations, as for example differences in general urine concentration due to differences in fluid intake, affect all

ACS Paragon Plus Environment

11

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

features in a similar fashion. To obtain a single value for each dataset, the mean of the sample-wise standard deviations (mswsd) is calculated for a given amount of selected reference features. To get an estimate of the variability of the so calculated mswsd values we employed a resampling approach where a subset containing 66% of the samples of the original data matrix is randomly selected and the mswsd value is computed for this subset. For a given amount of selected reference features 100 iterations are performed. Furthermore, at each iteration step the total spectral areas of the selected normalization reference features are computed and analyzed for the presence of unbalanced regulation by means of the Shapiro-Wilk normality test (step E in Figure 2). Next, features are iteratively removed from the list of reference features (step F in Figure 2) and new mswsds are computed until only 10% of the original features are used as reference features. No further reduction is sought, as the number of features remaining becomes insufficient for a stable computation of the normalization parameters (data not shown). To select an optimal setting for the amount of used reference features, the corresponding mswsd values are plotted and by manual inspection of the resulting curve the point is selected where the mswsd approaches a nearly constant minimal value and its variation does not decrease further (step G in Figure 2). This setting is additionally verified by analyzing the corresponding result of the Shapiro-Wilk normality test as at this point no unbalanced regulation should be present among the features selected as normalization reference features. Ideally, this process will remove all features from the list of reference features, which are affected by metabolite regulation, including unbalanced regulation.

Results Detection of intra- and inter-group inhomogeneities Application of the Shapiro-Wilk-Test to the AKI dataset yielded a p-value of 1.6 * 10-4. This indicated the presence of significant inter- and/or intra-group inhomogeneities in total spectral area, that were caused in part by the significantly (p=0.03) higher glucose levels in the AKI group (9.72±2.73 mmol/L) than in the non-AKI group (8.38±2.83 mmol/L). As a consequence, VSN preprocessing yielded an erroneously significant (p=1.2x10-6) difference in the abundance of CaEDTA2- between the AKI and the non-AKI-group as revealed by subsequent targeted quantitative analysis of CaEDTA2- (Figure 3, VSN all). In contrast, simple normalization of spectral features ACS Paragon Plus Environment

12

Page 12 of 35

Page 13 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

to the TSP reference signal followed by log2-transformation, confirmed for CaEDTA2the absence of a significant intergroup difference (p = 0.67), but instead revealed MgEDTA2-, which had not been among the discriminating features upon VSN, to be highly discriminative with a p-value of 1.9 * 10-6 (Figure 3, TSP + log2). This finding could also be confirmed by targeted quantitative analysis. Note that careful manual inspection of the spectra revealed that the bins corresponding to CaEDTA2- and MgEDTA2- at 2.565 ppm and 2.705 ppm, respectively, do not contain contributions from citrate, although the small citrate signals are in close proximity. Simple normalization to the TSP signal worked well in this case, although it provided only a correction for differences in spectrometer performance and no adjustment for noninduced biological variances and technical biases not related to spectrometer performance. This may not always be the case and, thus, appropriate normalization may still be required to identify discriminating features.

Analysis of normalization methods in the presence of confounding factors As already discussed in the introduction, improper data normalization caused by unbalanced regulation may lead to an erroneous determination of fold changes. Therefore, the performance of different normalization methods in the presence of increasing amounts of unbalanced regulation with respect to the correct determination of fold changes was systematically investigated employing the quasi Latin-square design. Here, expected fold changes were known by experimental design, allowing detailed comparisons between measured and expected fold changes. We were particularly interested in the degree, to which large and small inter-group differences in glucose levels might impact data normalization. Figures 4 and S1 show the deviations between expected and observed fold changes in the presence of decreasing inter-group differences in glucose levels (G8-G1, 198.501 mmol/L; G7-G2, 96.906 mmol/L; G6-G3, 43.764 mmol/L; G5-G4, 12,504 mmol/L) for the different data normalization methods tested in the form of boxplot diagrams. For each boxplot, two groups of samples, each consisting of 4 samples, are compared resulting in 16 possible combinations. As in each sample the amounts of 8 different metabolites were systematically varied, a total of 128 fold changes are analyzed within each boxplot. The most upper left section of Figure 4 depicts the results obtained for non-normalized data. Note that each of the eight groups G1 to G8 contains four different simulated dilution levels each. As a consequence, large

ACS Paragon Plus Environment

13

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

positive or negative deviations between expected and observed fold changes will be observed without normalization as indicated by the corresponding inter-quartile ranges, which amounted in all cases to about 2.0. However, the median deviation between expected and observed fold changes will be close to zero. With proper normalization the inter-quartile ranges of the boxes should considerably decrease, while keeping the median deviations close to zero. In summary, a small distance between the lower and upper quartile indicates that biases related to the different simulated dilutions were successfully reduced within each group, whereas a small deviation of the median value from zero indicates the absence of systematic differences between groups. A method that worked well was Linear Baseline Normalization based on median values (Fig. 4); it successfully reduced both within- and between-groups caused by the different dilution and glucose levels, respectively. The comparisons G5-G4, G6G3, G7-G2, and G8-G1 yielded inter-quartile ranges of 0.82, 0.80, 0.70, and 0.62, and median values of -0.07, -0.25, -0.50, and -0.60, respectively. Li-Wong Normalization, in contrast, performed the worst, yielding large inter-quartile ranges of up to 5.5 (Fig. S1). All other tested normalization approaches performed between these two extreme cases. For example, Quantile Normalization, Cubic Spline Normalization, Probabilistic Quotient Normalization, and Variance Stabilization Normalization allowed a reliable reproduction of fold changes for small degrees of unbalanced regulation. A method that also performed well in the presence of larger amounts of unbalanced regulation was probabilistic quotient normalization, yielding for the comparison G8-G1 an inter-quartile range and a median value of 0.82 and 0.86, respectively (Fig. 4). Next, the AKI dataset was analyzed. The presence of AKI leads to a reduction in glomerular filtration rate and a concomitant increase in serum metabolite levels excreted primarily into urine (manuscript in preparation). As a consequence, highly unbalanced metabolite regulation is observed between AKI and non-AKI patients as well as between AKI patients depending on the degree of loss of kidney function. As a test case for proper data normalization we chose the Ca-EDTA2- and Mg-EDTA2signals. Calcium serum levels are tightly controlled. Even in the presence of AKI, they should vary only slightly as maintenance of calcium levels in intra- and extracellular fluids is critical for functions such as neuromuscular activity, hormone release and action, enzyme regulation, and others.29 In contrast, for magnesium larger

ACS Paragon Plus Environment

14

Page 14 of 35

Page 15 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

differences may be observed. This is also reflected by targeted quantitative analysis, which shows for the comparison between the AKI and the non-AKI groups p-values of 0.49 and 1.0 * 10-5 for calcium and magnesium, respectively. Employing the AKI fingerprinting dataset, which consists of a matrix of integrated intensity values in combination with normalization relative to the reference TSP signal followed by log2 transformation, gave comparable p-values of 0.67 and 1.9 * 10-6 for calcium and magnesium, respectively. However, normalization relative to TSP corrects only for variations in spectrometer performance, other technical biases or non-induced biological variations are not reduced. In case of VSN, markedly different p-values of 1.2 * 10-6 and 0.79 were obtained for calcium and magnesium, respectively, which is highly improbable from a biological point of view. Next, we applied Linear Baseline Normalization based on medians, which allowed for the Latin-square data a reliable reproduction of fold changes. However, even this method failed to reproduce the results obtained by the targeted analysis. This can be explained by the fact that in contrast to the Latin-square data where only glucose was present as a confounding factor, unbalanced regulation is caused by many different metabolites for the AKI data. A method that was not among the best performing methods for the Latin-square design was the approach proposed by Li and Wong11 which contains an inherent feature selection step. However, for the AKI dataset it performed quite well yielding pvalues of 0.84 and 1.2 *10-6 for calcium and magnesium, respectively. In light of these results and the fact that unbalanced regulation is caused by a smaller or larger subset of all features, we devised an approach where determination of the parameters of the respective normalization approach is based on a subset of the data, which ideally contains only non-regulated features. Employing the parameters obtained on this smaller subset termed the normalization reference features all data are subsequently normalized. One way to automatically determine the normalization reference features involves analyzing the variance of the individual features across samples, followed by exclusion of the most variable ones. One parameter that has to be individually calibrated for each dataset is the selection of reference features. For this, we devised a strategy where increasing amounts of features were iteratively excluded based on their degree of variance across samples until so-called mswsd values and their variation converge to a nearly constant value. As detailed in the materials and methods section, mswsd values were computed by calculating the mean of the sample-wise standard deviations (mswsd) of individual feature wise

ACS Paragon Plus Environment

15

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

scaling factors. To assess the stability of the mswsd values a resampling approach was employed where repeatedly a subset containing 66% of the samples of the original data matrix is randomly selected and the mswsd value is computed for this subset. For the Latin-square dataset, the decrease in mswsd values in relation to the amount of features used is shown in Figure 5A. The results from the resampling approach are displayed as boxplots, where the thick black line within each box indicates the median of the individual mswsd values and the inter-quartile range gives their variability. Reduction of the normalization reference features used from 95% to 90%, thereby excluding the most variable features in the dataset, led to a substantial drop in median mswsd from 6.52 to 3.21 with a concomitant decrease in the inter-quartile range from 2.03 to 0.40. Further reductions in the amount of normalization reference features used yielded increasingly smaller decreases in median mswsd values and their variability. Using 20% of the least variant features reduced the mswsd and the inter-quartile range to 1.69 and 0.07, respectively, while the corresponding values for 10% remained almost unchanged at 1.66 and 0.06. In addition, the total spectral areas of the normalization reference features were computed and analyzed for the presence of unbalanced regulation by means of the Shapiro-Wilk normality test (Table 2). Results indicate for 10%, 15% and 20% of normalization reference features the absence of unbalanced regulation. Therefore, 20% of the least variable features were chosen as normalization reference features. Of the previous data normalization approaches, Linear Baseline Normalization based on either mean or median values, Probabilistic Quotient Normalization, Variance Stabilization Normalization, Cyclic Loess Normalization, Cubic Spline Normalization, and Contrast Normalization were combined with prior selection of normalization reference features. Li-Wong Normalization, which already contains an inherent feature selection step, was excluded here. The aim of Quantile Normalization is to achieve for all features the same intensity distribution across samples. By prior exclusion of some of the features this could not be achieved for the excluded features, therefore, Quantile Normalization was not combined with prior feature selection. Further excluded were Auto and Pareto Scaling, as both methods treat each feature independently and, therefore, prior selection of normalization reference features would lead to loss of all other data points. As can be seen from Figures 6 and S2 for the Latin-square data, within-group biases and biases due to unbalanced regulation were successfully reduced for all group comparisons by the two linear

ACS Paragon Plus Environment

16

Page 16 of 35

Page 17 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

baseline normalization methods based on mean and median values, respectively, as well as by Probabilistic Quotient Normalization and Variance Stabilization Normalization. These four methods performed almost equal with a very slight advantage for Variance Stabilization Normalization (Table 3). Here, the group comparisons G8-G1, G7-G2, G6-G3, and G5-G4 yielded for the comparison between expected and determined fold changes inter-quartile ranges of 0.91, 0.93, 1.09, and 1.27, and median values of -0.06, 0.26, 0.02, and -0.07, respectively. We further analyzed the stability of the obtained results with respect to selecting the optimal amount of normalization reference features. As inspection of Figure 5A showed that in the region between 90% and 10% of used normalization reference features only a comparably small reduction of mswsd values and their variability was observed, calculations for Variance Stabilization Normalization were repeated employing 90% and 10% of the least variable features as normalization reference features. Employing 90% of features for normalization the group contrasts G8-G1, G7-G2, G6G3, and G5-G4 were analyzed and for the comparison between expected and determined fold changes inter-quartile ranges of 0.56, 0.71, 0.80, and 0.80, and median values of -0.78, -0.53, -0.44, and 0.05 were obtained, respectively. Employing 10% of the data as reference features we obtained for the four group contrasts G8-G1, G7-G2, G6-G3, and G5-G4 inter-quartile ranges of 1.12, 1.21, 1.23, and 1.57 and median values of 0.09, 0.20, 0.11, and 0.06, respectively. Comparing these results with those obtained for 20% of normalization reference features indicates that for the Latin-square dataset the exact amount of used normalization reference features is not critical as long as the median mswsd values and their corresponding variability do not increase substantially compared to their minimum values. Further analysis of the excluded features showed that the confounding factor glucose was successfully eliminated from the set of reference features. To select for the AKI dataset the optimal set of normalization reference features, mswsd values (Figure 5B) were analyzed as described above. Reduction of reference features from 100% to 30% resulted in a relatively smooth decline of median mswsd values from 2.53 to 1.44 and a reduction of box widths from 0.23 to 0.07. Further, reduction of reference features to 20% and 10% resulted in a constant mswsd value of 1.48 and a further decline of box widths to 0.02 and 0.03, respectively. Furthermore, the total spectral areas of the normalization reference features were computed and analyzed for the presence of unbalanced regulation

ACS Paragon Plus Environment

17

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(Table 2). Results indicate for 10%, 15% and 20% of normalization reference features no unbalanced regulation. Therefore, the 20% least variable features were selected as reference features. Application of prior selection of normalization reference features in combination with Linear Baseline Normalization based on either mean or median values, Probabilistic Quotient Normalization, and Variance Stabilization Normalization, respectively, to the AKI data yielded upon log2 transformation for calcium insignificant p-values of 0.37, 0.72, 0.59, and 0.65, respectively, while for magnesium p-values of 6.4 * 10-6, 5.7 * 10-6, 4.5 * 10-6, and 6.9 * 10-6 were obtained comparing patients that developed AKI with those, who did not. Note that for Variance Stabilization Normalization no log2 transform was performed as this method already transforms the data for variance stabilization. These data are in line with the results obtained from targeted analyses. In the bottom panel of Figure 3 the feature intensities for CaEDTA2- and MgEDTA2following Variance Stabilization Normalization are shown as boxplots. Similar to the results obtained for the Latin-square data, prior selection of normalization reference features did not improve significantly the results for Cyclic Loess Normalization, Cubic Spline Normalization, and Contrast Normalization. As for the Latin-square dataset we further analyzed the stability of the obtained results with respect to selecting the optimal amount of normalization reference features. Inspection of Figure 5B revealed only relatively slight variation in mswsd values and their variability in the region between 20% and 10% of normalization reference features used. Therefore, calculations for the above four top performing methods were repeated employing 15% and 10% of the least variable features for normalization. Employing 15% of features and application of Linear Baseline Normalization based on either mean or median values, Probabilistic Quotient Normalization, and Variance Stabilization Normalization resulted for calcium in insignificant p-values of 0.91, 0.66, 0.81, and 0.82, respectively, while for magnesium p-values of 1.47 * 10-6, 9.10 * 10-7, 1.40 * 106

, and 9.63 * 10-6 were obtained. Further reduction of the amount of normalization

reference features to 10% yielded in the same order as above for calcium p-values of 0.37, 0.44, 0.28, and 0.54, respectively, while for magnesium p-values of 2.83 * 10-7, 5.18 * 10-6, 3.14 * 10-7, and 2.65 * 10-6 were obtained. Comparing the results obtained for 20%, 15% and 10% of normalization reference features shows only minor differences, therefore all three values present for the AKI dataset suitable choices for the amount of employed normalization reference features.

ACS Paragon Plus Environment

18

Page 18 of 35

Page 19 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

In summary, results show for both datasets investigated that the exact amount of used normalization reference features is not critical as long as the median mswsd values and their corresponding variability do not increase substantially compared to their minimum values and that the Shapiro-Wilk-Test indicates the absence of unbalanced regulation for the normalization reference features.

Discussion Results for the Latin-square dataset indicate, that as long as differences in glucose levels are sufficiently small, Quantile, Probabilistic Quotient, Cubic Spline, and Variance Stabilization Normalization, respectively, will not distort the data. Linear Baseline Normalization based on medians also worked well. This is in good agreement with our previous publication that had shown Quantile, Cubic Splines, and Variance Stabilization Normalization to work well in the absence of unbalanced regulation.28 Linear Baseline Normalization based on medians was not included in our previous study. Inspection of results showed for most of the tested methods that increasing amounts of unbalanced regulation strongly affected data normalization. The method that worked well even in the presence of relatively large differences of confounding factors was Linear Baseline Normalization based on medians. The different results obtained by Linear Baseline Normalization based either on mean or median values can be explained by the fact that a baseline calculated based on median values is more robust against differences in confounding factors than a baseline based on mean values where extreme outliers will bias the estimation of the mean. However, data also showed that removal of within group biases, as indicated by the box width, is more effective when the Linear Baseline Normalization is based on mean instead of median values. Nevertheless, when applied to the AKI data, where unbalanced regulation is due to a large number of features exclusively upregulated in the group of AKI patients, all of the tested methods failed. As an alternative solution, we tested whether prior selection of a suitable subset of normalization reference features yielded substantially improved results. As explained above, the variance of the individual features was used as the filter criterion. Results showed that both Linear Baseline Normalization methods, Probabilistic Quotient Normalization, and Variance Stabilization Normalization based on normalization reference features improved results for both the Latin-square dataset and, even more so, for the AKI dataset in comparison to no selection of normalization reference ACS Paragon Plus Environment

19

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

features. Therefore, the application of variance based feature filtering prior to data normalization appears to be a suitable option in the presence of unbalanced regulation of metabolites. In this context, it is interesting to note the failure of the normalization method by Li and Wong, which already contains an inherent feature selection step, for the Latin-square design. The likely reason is that the selection of invariant features is based on rank, i.e. features in spectra are ranked according to their intensity and only features of similar rank are then used to fit the normalization curve. Since glucose was the most abundant metabolite in the Latin-square design, its spectral features always ranked identically across samples and, thus, were used to fit the normalization curve, thereby distorting the signal intensities of the spiked-in metabolites upon normalization. In contrast, for the AKI dataset, where unbalanced regulation is due to a large number of different metabolites, the method worked quite well.

Conclusions In summary, the use of an appropriate data normalization procedure is of prime importance for subsequent statistical data analysis as exemplified here for the correct detection of fold changes. Detection of unbalanced regulation is accomplished by the Shapiro-Wilk-Test. We recommend using this tool in combination with a manual inspection of the data before reaching a final conclusion. Based on the results of these pre-tests, a suitable data normalization method may be selected. In case of no or only a small degree of unbalanced regulation, Quantile, Probabilistic Quotient, Cubic Spline, Linear Baseline Normalization based on medians, and Variance Stabilization Normalization present suitable choices. When a larger degree of unbalanced regulation is observed, reference feature selection in combination with Variance Stabilization Normalization, Linear Baseline Normalization based on means or medians, or Probabilistic Quotient Normalization may be used.

Supporting Information This material is available free of charge via http://pubs.acs.org/ Figure S1 - Deviations between expected and observed fold changes Figure S2 - Deviations between expected and observed fold changes after exclusion of high variance signals

ACS Paragon Plus Environment

20

Page 20 of 35

Page 21 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Software availability The R-code for the selection of normalization reference features in combination with the tested normalization approaches is available from the authors upon request or from “https://genomics.uni-regensburg.de/site/institute/software”.

Author contributions JH, WG, FT, JE, RS, and PO devised the method, WG and JH implemented and tested it, HZ and CS performed experiments, WG, JH, FT, RS and PO wrote the manuscript.

Acknowledgements The authors thank Drs. Gunnar Schley, Carsten Willam, and Kai-Uwe Eckardt for providing the plasma specimens of the AKI dataset. The authors thank Trixi von Schlippenbach for critical reading of the manuscript. Financial support was provided by the German Federal Ministry of Education and Research (Grant no. 01 ER 0821 and Grant no. 0316166G) and the German Research Foundation (KFO 262).

Competing interests The authors declare no competing interests.

ACS Paragon Plus Environment

21

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

References

(1) Wishart, D. S. Computational Approaches to Metabolomics, Methods Mol.Biol. 2010, 593, 283– 313. (2) Holmes, E.; Foxall, P. J. D.; Spraul, M.; Farrant, R. D.; Nicholson, J. K.; Lindon, J. C. 750 MHz 1H NMR Spectroscopy Characterisation of the Complex Metabolic Pattern of Urine from Patients with Inborn Errors of Metabolism: 2-hydroxyglutaric Aciduria and Maple Syrup Urine Disease, J.Pharm.Biomed.Anal. 1997, 15, 1647–1659. (3) Klein, M. S.; Buttchereit, N.; Miemczyk, S. P.; Immervoll, A. K.; Louis, C.; Wiedemann, S.; Junge, W.; Thaller, G.; Oefner, P. J.; Gronwald, W. NMR metabolomic analysis of dairy cows reveals milk glycerophosphocholine to phosphocholine ratio as prognostic biomarker for risk of ketosis, J.Proteome Res. 2012, 11, 1373–1381. (4) Bolstad, B. M.; Irizarry, R. A.; Astrand, M.; Speed, T. P. A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Variance and Bias, Bioinformatics. 2003, 19, 185–193. (5) Sysi-Aho, M.; Katajamaa, M.; Yetukuri, L.; Oresic, M. Normalization method for metabolomics data using optimal selection of multiple internal standards, BMC Bioinformatics. 2007, 8, 93. (6) Bellomo, R.; Ronco, C.; Kellum, J. A.; Mehta, R. L.; Palevsky, P. Acute renal failure - definition, outcome measures, animal models, fluid therapy and information technology needs: the Second International Consensus Conference of the Acute Dialysis Quality Initiative (ADQI) Group, Crit Care. 2004, 8, R204. (7) Dieterle, F.; Ross, A.; Schlotterbeck, G.; Senn, H. Probabilistic Quotient Normalization as Robust Method to Account for Dillution of Complex Biological Mixtures. Application to 1H NMR Metabolomics, Anal.Chem. 2006, 78, 4281–4290. (8) Cleveland, W. S.; Devlin, S. J. Locally Weighted Regression - an Approach to Regression-Analysis by Local Fitting, J.Am.Stat.Assoc. 1988, 83, 596–610. (9) Dudoit, S.; Yang, Y. H.; Callow, M. J.; Speed, T. P. Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments, Stat.Sinica. 2002, 12, 111–139. (10) Astrand, M. Contrast normalization of oligonucleotide arrays, J Comput Biol. 2003, 10, 95–102. (11) Li, C.; Wong, W. H. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection, Proc Natl Acad Sci U S A. 2001, 98, 31–36. (12) Workman, C.; Jensen, L. J.; Jarmer, H.; Berka, R.; Gautier, L.; Nielser, H. B.; Saxild, H. H.; Nielsen, C.; Brunak, S.; Knudsen, S. A New Non-Linear Normalization Method for Reducing Variability in DNA Microarray Experiments, Genome Biol. 2002, 3, research0048. (13) Jackson, J. E. A User's Guide to Principal Components; Wiley-Interscience: Hoboken, N.J, 2003. (14) Eriksson, L.; Antti, H.; Gottfries, J.; Holmes, E.; Johansson, E.; Lindgren, F.; Long, I.; Lundstedt, T.; Trygg, J.; Wold, S. Using Chemometrics for Navigating in the Large Data Sets of Genomics, Proteomics, and Metabonomics (gpm), Anal.Bioanal.Chem. 2004, 380, 419–429. (15) Huber, W.; Heydebreck, A. V.; Sültmann, H.; Poustka, A.; Vingron, M. Variance Stabilisation Applied to Microarray Data Calibration and to the Quantification of Differential Expression, Bioinformatics. 2002, 18, S96. ACS Paragon Plus Environment

22

Page 22 of 35

Page 23 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(16) Jauhiainen, A.; Madhu, B.; Narita, M.; Narita, M.; Griffiths, J.; Tavaré, S. Normalization of metabolomics data with applications to correlation maps, Bioinformatics. 2014, 30, 2155–2161. (17) Pelz, C. R.; Kulesz-Martin, M.; Bagby, G.; Sears, R. C. Global rank-invariant set normalization (GRSN) to reduce systematic distortions in microarray data, BMC Bioinformatics. 2008, 9, 520. (18) De Livera, Alysha M; Dias, D. A.; Souza, D. de; Rupasinghe, T.; Pyke, J.; Tull, D.; Roessner, U.; McConville, M.; Speed, T. P. Normalizing and integrating metabolomics data, Analytical chemistry. 2012, 84, 10768–10776. (19) De Livera, Alysha M; Sysi-Aho, M.; Jacob, L.; Gagnon-Bartsch, J. A.; Castillo, S.; Simpson, J. A.; Speed, T. P. Statistical methods for handling unwanted variation in metabolomics data, Analytical chemistry. 2015. (20) Laywine, C. F.; Mullen, G. L. Discrete Mathematics using Latin Squares; Wiley: New York, 1998. (21) Zacharias, H. U.; Schley, G.; Hochrein, J.; Klein, M. S.; Köberle, C.; Eckardt, K. U.; Willam, C.; Oefner, P. J.; Gronwald, W. Analysis of Human Urine Reveals Metabolic Changes Related to the Developmento of Acute Kidney Injury Following Cardiac Surgery, Metabolomics. 2013, 9, 697– 707. (22) Gronwald, W.; Klein, M. S.; Kaspar, H.; Fagerer, S.; Nürnberger, N.; Dettmer, K.; Bertsch, T.; Oefner, P. J. Urinary Metabolite Quantification Employing 2D NMR Spectroscopy, Anal.Chem. 2008, 80, 9288–9297. (23) Zacharias, H. U.; Hochrein, J.; Klein, M. S.; Samol, C.; Oefner, P. J.; Gronwald, W. Current Experimental, Bioinformatic and Statistical Methods used in NMR Based Metabolomics, Curr.Metabol. 2013, 1, 253–268. (24) Development Core Team, R. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2009. (25) Gronwald, W.; Klein, M. S.; Zeltner, R.; Schulze, B. D.; Reinhold, S. W.; Deutschmann, M.; Immervoll, A. K.; Böger, C. A.; Banas, B.; Eckardt, K. U.; Oefner, P. J. Detection of Autosomal Polycystic Kidney Disease Using NMR Spectroscopic Fingerprints of Urine, Kidney Int. 2011, 79, 1244–1253. (26) Hochrein, J.; Klein, M. S.; Zacharias, H. U.; Li, J.; Wijffels, G.; Schirra, H. J.; Spang, R.; Oefner, P. J.; Gronwald, W. Performance Evaluation of Algorithms for the Classification of Metabolic 1H-NMR Fingerprints, J.Proteome Res. 2012, 11, 6242–6251. (27) Altman, D. G.; Bland, J. M. Measurement in Medicine: The Analysis of Method Comparison Studies, J.Roy.Stat.Soc.D-Sta. 1983, 32, 307–317. (28) Kohl, S. M.; Klein, M. S.; Hochrein, J.; Oefner, P. J.; Spang, R.; Gronwald, W. State-of-the Art Data Normalization Methods Improve NMR-Based Metabolomic Analysis, Metabolomics. 2012, 8, 146–160. (29) Agus, Z. S.; Wasserstein, A.; Goldfarb, S. Disorders of calcium and magnesium homeostasis, Am. J. Med. 1982, 72, 473–488.

ACS Paragon Plus Environment

23

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 35

Tables Table 1. Quasi Latin-square design. Concentrations (mmol/L) of spike-ins and number of scans acquired are listed. DL-βAminoisobutyric acid

LAlanine

Citric acid

Creatinine

DLOrnithine

DLValine

Formic acid

LGlutamine

DGlucose

Scans

1

3.125

1.563

0.195

6.25

0.049

0.098

0.391

0.781

200

128

2

0.195

3.125

6.25

0.049

0.391

0.098

1.563

0.781

100

128

3

1.563

3.125

0.391

0.049

0.781

0.098

0.195

6.25

50

128

4

0.781

0.195

0.098

6.25

0.391

1.563

3.125

0.049

25

128

5

1.563

0.391

0.781

0.049

6.25

3.125

0.098

0.195

12.5

128

6

0.781

6.25

3.125

0.049

1.563

0.391

0.195

0.098

6.25

128

7

0.195

6.25

0.049

0.781

0.098

1.563

3.125

0.391

3.125

128

8

3.125

0.781

6.25

0.049

0.391

0.195

1.563

0.098

1.563

128

9

6.25

0.195

0.391

3.125

0.781

0.098

1.563

0.049

200

64

10

0.195

0.391

0.781

0.049

0.098

6.25

1.563

3.125

100

64

11

6.25

3.125

0.049

1.563

0.781

0.195

0.391

0.098

50

64

12

1.563

6.25

0.049

0.098

0.391

3.125

0.195

0.781

25

64

13

1.563

0.195

6.25

0.391

0.781

3.125

0.049

0.098

12.5

64

14

0.391

6.25

3.125

1,563

0.098

0.781

0.195

0.049

6.25

64

15

3.125

0.781

0.195

0.049

0.098

0.391

6.25

1.563

3.125

64

16

0.391

1.563

0.098

0.195

0.781

6.25

0.049

3.125

1.563

64

17

0.195

6.25

0.781

0.049

3.125

1.563

0.098

0.391

200

32

18

0.781

0.195

0.391

0.098

0.049

1.563

3,125

6.25

100

32

19

0.781

3.125

0.195

0.391

6.25

1.563

0.049

0.098

50

32

20

0.049

3.125

0.391

0.195

0.098

1.563

6.25

0.781

25

32

21

6.25

0.391

0.195

0.098

0.781

0.049

3.125

1.563

12.5

32

22

6.25

0.391

0.195

0.781

0.049

0.098

3.125

1.563

6.25

32

23

0.049

6.25

3.125

0.781

1.563

0.391

0.195

0.098

3.125

32

24

0.781

0.195

0.098

6.25

0.049

1,563

0.391

3.125

1.563

32

25

0.195

0.049

3.125

1.563

0.098

0.781

0.391

6.25

200

16

26

0.098

0.049

0.781

1.563

3.125

0.195

0.391

6.25

100

16

27

0.781

3.125

0.195

0.098

6.25

0.391

1.563

0.049

50

16

28

0.195

3.125

0.049

0.098

0.781

6.25

1.563

0.391

25

16

29

3.125

0.098

0.391

1,563

0.781

6.25

0.195

0.049

12.5

16

30

0.781

6.25

0.195

0.049

3.125

0.098

1.563

0.391

6.25

16

31

0.098

3.125

0.391

1.563

0.195

6.25

0.049

0.781

3.125

16

32

0.391

6.25

1.563

0.098

0.781

0.049

0.195

3.125

1.563

16

33

blank

blank

blank

blank

blank

blank

blank

blank

blank

128, 64, 32,16

Sample No.

ACS Paragon Plus Environment

24

Page 25 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table 2. Results of the Shapiro-Wilk normality test for different amounts of normalization reference features. percentage of

Latin-square design

normalization reference

(p-value)

AKI (p-value)

features 10%

0.34

0.11

15%

0.08

0.26

20%

0.06

0.39

25%

9.3 * 10-3

4.7 * 10-4

30%

5.5 * 10-4

1.5 * 10-7

35%

7.6 * 10-6

3.2 * 10-11

40%

2.4 * 10-7

3.1 * 10-11

45%

1.3 * 10-6

1.6 * 10-9

50%

9.6 * 10-5

4.1 * 10-8

55%

1.8 * 10-3

6.2 * 10-8

60%

3.1 * 10-3

9.5 * 10-8

65%

1.3 * 10-2

1.8 * 10-7

70%

1.9 * 10-3

1.1 * 10-7

75%

9.9 * 10-5

1.1 * 10-7

80%

8.8 * 10-5

2.1 * 10-8

85%

5.7 * 10-6

8.3 * 10-8

90%

5.8 * 10-5

3.4 * 10-7

95%

6.5 * 10-4

4.2 * 10-5

100%

5.6 * 10-8

1.6 * 10-4

ACS Paragon Plus Environment

25

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 35

Table 3. Deviations between expected and observed fold changes as measured in terms of inter-quartile ranges and median deviations. Results are based on the Latinsquare dataset and are given for the four best performing normalization approaches

Normalization approach

Linear Baseline Normalization based on mean values Linear Baseline Normalization based on median values Probabilistic Quotient Normalization Variance Stabilization Normalization

inter-quartile ranges for group comparisons

median deviations for group comparisons

G8-G1 0.87

G7-G2 0.98

G6-G3 1.19

G5-G4 1.16

G8-G1 -0.14

G7-G2 0.35

G6-G3 0.02

G5-G4 -0.05

0.93

1.05

1.25

1.23

-0.07

0.41

0.13

-0.16

0.94

0.98

1.25

1.35

-0.11

0.37

-0.08

-0.06

0.91

0.93

1.09

1.27

-0.06

0.26

0.02

-0.07

ACS Paragon Plus Environment

26

Page 27 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure legends Figure 1. Exemplary 1D 1H spectra. A) A section of a plasma 1D 1H spectrum of an AKI patient is shown. Signals of free EDTA, CaEDTA2-, and MgEDTA2- are marked. B) A typical spectrum of the Latin-square dataset is displayed. Metabolites added to the urine matrix are marked.

Figure 2. Flow chart describing the envisaged workflow. The workflow starts with procedures for the detection of unbalanced regulation. In the absence of unbalanced regulation, one can proceed with standard normalization procedures. In the presence of unbalanced regulation, features showing high variability in concentration will be iteratively removed until mostly non-regulated features remain. These will be used as reference features for subsequent data normalization. Figure 3. Box plot representation of the CaEDTA2- and MgEDTA2- 1H NMR signal intensities of the corresponding NCH2CH2N signals at 2.565 and 2.705 ppm, respectively. The size of the boxes is defined by the upper and lower quartile of the data while the thick line within each box indicates the median. The whiskers extend to the most extreme data point lying within 1.5 times the inter-quartile distance from the box. Outliers are indicated by small circles. Starting from the upper left corner results for data scaled relative to the TSP reference signal followed by log2 transformation, Variance

Stabilization

Normalization

employing

all

features

and

Variance

Stabilization Normalization based on the 20% least variable features are shown.

Figure 4. Deviations between expected and observed fold changes. Deviations between expected and observed fold changes are shown as boxplots for four decreasing differences in glucose levels and for each of the tested normalization methods. The y-axis indicates the difference in fold changes while the x-axis denotes the four group comparisons G8 versus G1, G7 versus G2, G6 versus G3, and G5 versus G4, with corresponding differences in glucose concentration of 198.4 mmol/L, 96.9 mmol/L, 43.8 mmol/L, and 12.5 mmol/L, respectively. The size of the boxes is defined by the upper and lower quartile of the data while the thick line within each box indicates the median. The whiskers extend to the most extreme data point lying within 1.5 times the inter-quartile distance from the box. Outliers are indicated by small circles. From the upper left corner to the lower right corner, results for not ACS Paragon Plus Environment

27

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

normalized data, Linear Baseline Normalization based on mean values, Linear Baseline Normalization based on median values, Probabilistic Quotient Normalization (PQN), Quantile Normalization, and Variance Stabilization Normalization (VSN) are shown.

Figure 5. Determination of reference features. A) Latin-square design, and B) AKI data. On the x-axis the percentage of features used as reference features for normalization is shown. The y-axis indicates the mean standard deviation of the individual scaling factors of the selected reference features (mswsd). For each setting of selected reference features a resampling approach was used for the computation of mswsds. These are depicted as box plots for assessing their stability. A red horizontal line indicates where the mswsd values approach a nearly constant value.

Figure 6. Deviations between expected and observed fold changes after exclusion of high variance signals. Eighty per cent of the features showing the highest variance were excluded from calculating the parameters of the data normalization. Deviations between expected and observed fold changes are shown as boxplots using the same parameters as given for Figure 2. From the upper left corner to the lower right corner, results for Linear Baseline Normalization based on mean values, Linear Baseline Normalization based on median values, Probabilistic Quotient Normalization (PQN), Cyclic Loess Normalization, Cubic Spline Normalization, and Contrast Normalization are shown.

ACS Paragon Plus Environment

28

Page 28 of 35

Page 29 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 1

ACS Paragon Plus Environment

29

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 2

ACS Paragon Plus Environment

30

Page 30 of 35

Page 31 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 3

ACS Paragon Plus Environment

31

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 4

ACS Paragon Plus Environment

32

Page 32 of 35

Page 33 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 5

ACS Paragon Plus Environment

33

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 6

ACS Paragon Plus Environment

34

Page 34 of 35

Page 35 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Graphical Abstract/TOC Data normalization is an essential step in NMR-based metabolomics, especially, in the presence of unbalanced metabolic regulation. Results show that the ShapiroWilk-Test may be used for the detection of unbalanced regulation. It is also demonstrated

that

Linear

Baseline

Normalization,

Probabilistic

Quotient

Normalization and Variance Stabilization Normalization in combination with variance based feature selection are relatively stable even in the presence of large intergroup inhomogeneities.

ACS Paragon Plus Environment

35