Correlation Patterns in Experimental Data Are ... - ACS Publications

Nov 25, 2016 - However, data structure is affected by normalization. In this paper, we show how, and to what extent, the correlation structure is affe...
1 downloads 3 Views 3MB Size
Subscriber access provided by University of Otago Library

Article

Correlation patterns in experimental data are affected by normalization procedures: consequences for data analysis and network inference Edoardo Saccenti J. Proteome Res., Just Accepted Manuscript • Publication Date (Web): 25 Nov 2016 Downloaded from http://pubs.acs.org on November 25, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 44

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Correlation patterns in experimental data are affected by normalization procedures: consequences for data analysis and network inference Edoardo Saccenti1* 1

Laboratory of Systems and Synthetic Biology, Wageningen University

Corresponding Author * Edoardo Saccenti, Laboratory of Systems and Synthetic Biology, Wageningen University, Stippeneng 4, 6708 HB, Wageningen, The Netherlands. Email: [email protected]; Tel: +31 (0)317 482018; Fax: +31 (0) 3174 83829

1 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 44

ABSTRACT

Normalization is a fundamental step in data processing to account for the sample-to-sample variation observed in biological samples. However, data structure is affected by normalization. In this paper we show how, and to what extent, the correlation structure is affected by the application of 11 different normalization procedures and we discuss the consequences for data analysis and interpretation including principal component analysis, partial least squares discrimination, and the inference of metabolite–metabolite association networks. KEYWORDS

Sample to sample variation; covariance analysis; spurious correlation, NMR, urine dilution; PCA; PLS-DA; COVSCA; INTRODUCTION

The concentrations of metabolites in biofluids can be considered to be related to the structure of the biological network underlying the different mechanisms governing an organism. For this reason, the concentrations of metabolites participating in these networks are supposed to exhibit correlation patterns.1 Such correlation patterns are exploited by most multivariate data analysis tools, such as principal component analysis (PCA) and partial least squares regression (PLS), which are commonly used in metabolomics and other omic disciplines; they are often also the entry point for tools for the inference of biological networks, such as coexpression2 or probabilistic networks,3 which are gaining popularity within metabolomics studies. Correlation patterns are inferred from the metabolite concentrations measured on biological replicate samples. However, sample-to-sample variability of both biological and technical (experimental) origin is often observed. This is especially true in the case of urine, which is by far one of the most frequently used biofluids in metabolomics. The urine volume can vary because of differences in water consumption and variations of up to a factor of 15 have been observed in urine excretion.4 Other sources of variation can be related to the sample preparation (variation in titration or buffer addition) or measurement: in the case of nuclear magnetic resonance (NMR) spectroscopy, perturbations in the baseline or changes in acquisition parameters like the number of scans.4 As a consequence, the concentrations of endogenous metabolites also vary and normalization is necessary to remove this unwanted variation and to make samples directly comparable. This problem is not restricted to metabolomics; for instance, normalization is a fundamental step in the processing microarray or RNAseq data to correct for between-sample distributional differences in read counts and library size, and several methods proposed in the field of transcriptomics have been applied to metabolomic data5 (see Table 1). 2 ACS Paragon Plus Environment

Page 3 of 44

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

However, while correcting for nuisance effects, normalization also affects the data structure, influencing the variance of the variables and ultimately perturbing the correlation patterns estimated from the data. Because of this, conclusions drawn from the analysis of correlation patterns can be biased depending on which normalization method is used. In this paper, we examine the effect of normalization on the structure of metabolomic data; this interest is certainly not new and has been addressed in the context of exploratory and predictive data analysis5, 6 and biomarker selection.5, 7 However, in the present study we analyze in detail how correlation patterns change upon normalization of the data (considering previously investigated methods5, 8, 9), and we show how this can affect PCA, partial least squares discriminant analysis (PLS-DA), and the inference of metabolite–metabolite association networks. Normalization affects correlations To set the scene, we begin with three motivating examples to illustrate how normalization can perturb the correlation structure of the data. First, a small numerical example mimicking the normalization to the total spectral area (see also “Materials and Methods”) is presented. Second, a wider range of simulation is presented to show the correlation patterns that arise when normalization is applied on uncorrelated variables. Third, a real life situation is considered. Numerical example 1: Normalization to total spectral area Let us consider three column vectors, x1, x2, and x3, containing the measures for three random variables, x1, x2, and x3 (i.e. metabolite), measured on n = 4 samples,  10.7   9.9  , x1    10.7     9.8 

 21.3   21.3  , x2    22.8     21.0 

 48.0   33.0  . x3    51.4     55.0 

(1)

which are stored in a data matrix X of size n × p. The Pearson’s correlation between the three variables is as follows:

 corr ( x1 , x2 )  0.65   corr ( x1 , x3 )  0.26 corr ( x , x )  0.19 2 3 

(2)

Now we normalize the data by dividing every row (i.e. every sample) in X by the sum of each element and call these new data Y. This is equivalent to normalizing a sample to the total sum of measured metabolite concentrations, often referred to as normalization to the total spectral area. The columns of Y are now

3 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

 0.13   0.15  , y1    0.13     0.13 

 0.27   0.33  , y2    0.27     0.25 

 0.60   0.51  . y3    0.61     0.64 

Page 4 of 44

(3)

The correlation between the three variables y1, y2, and y3 is now:

 corr ( y1 , y2 )  0.96  corr ( y1 , y3 )  0.98  corr ( y , y )  1.0 2 3 

(4)

The three variables are now strongly correlated. This simple example shows how normalization can alter the structure of the relationships among measured metabolites. Numerical example 2: Other normalization approaches Consider now three random variables, x, y, and z, which are independent and identically distributed (i.i.d.) and realized on samples of size n. These variables are uncorrelated at the population level. Figure 1 shows the correlation between x and y and the newly defined functions of random variables: x/z and y/z; xz and yz; x/(x +y); and y/(x +y). It appears that correlation patterns of “normalized” data can be different from those observed in raw data. In this case it is the z variable (and the denominator x + y) that introduces a correlation between the pairs xz and yz, x/z and x/z, and x/(x +y) and y/(x +y). These correlations are called spurious correlations and were first described by Pearson, who addressed the case of x/z and y/z.10 The short mathematical overview presented in the Appendix shows how, even in cases of simple manipulation of the data, it is very complicated, or even impossible, to analytically express the correlation between two variables in the normalized data as a function of the values of the non-normalized data. In most cases, normalization affects the correlation in an unpredictable way. Real life example We now consider a last motivating example using real NMR experimental data (see “Material and Methods”): we compare the correlation profiles of some metabolite pairs for the raw data and after normalization to creatinine concentration, which is a very common approach (although its use is not always appropriate; see “Material and Methods” for a discussion) to minimize the variation due to varying urine excretion (referred to as output in the literature). The results are shown in Figure 2. It appears that normalizing to creatinine can affect the correlation to the point that no relationship exists between the correlation patterns of two (or more) variables before and after data normalization.

4 ACS Paragon Plus Environment

Page 5 of 44

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

MATERIAL AND METHODS Overview of normalization methods The approaches for data normalization can be categorized into two main branches: methods that perform object-wise transformations with the aim of reducing between-sample variation and methods that aim to make the variance of different variables (approximately) equal by applying some transformation.11 The latter class includes scaling methods (such as unitvariance or Pareto scaling), which will be not be covered here; they have been discussed elsewhere.5, 9, 12 In this paper, we focus only on the first class of methods. Most common normalization methods estimate a weighting factor to be assigned to each sample (i.e. attempting to provide an estimation of the α factor in Equation (2)) to make the samples comparable.9, 13 For many linear methods, it is implicitly assumed that the average sum of all measured metabolites (or the total spectral area, TSA) is constant across samples or groups of samples; stated otherwise, it is assumed that the total amount of dissolved metabolites is invariable, an assumption rarely met in real life.11 Similarly, a method like the popular normalization to creatinine is based on the assumption that creatinine clearance is constant, which may not be true in the presence of pathophysiological alterations or metabolic dysregulation. When these assumptions are not met, biased or even wrong results can be obtained. In this paper, we consider the case of normal physiological conditions under which the application of the methods investigated here is justified. Hochrein et al. 9 investigated the performance of several methods in the presence of unbalanced metabolite regulation with respect to the correct detection of fold changes. We investigated the performance of 11 normalization methods, which are listed in Table 1. Many methods have been proposed and applied for data normalization; the rationale behind our choice is that these methods have already been extensively reviewed and compared in the papers by Kohl et al.5 and Hochrein et al.9. Thus this paper aims to complement those papers rather than to offer a comprehensive review of all existing methods and their characteristics and performance, and hence methods based on linear (mixed) models, which constitute a class in themselves, have not been considered.14, 15 Hereafter, we offer a brief description of the methods for the convenience of the reader. For more details, we refer to the original publications and to the abovementioned review papers. In the description of the methods, we will refer to metabolite concentrations; however the function of the methods is the same if bins or other spectral features are used. It can be noted that among the methods considered here, only TSA, urine output (UO), internal standard (IS), and probabilistic quotient normalization (PQN) are native methods for metabolomic data; all the others have been developed to normalize microarray data, which have inherently different properties in terms of variance and covariance patterns and error structure. It can also be noted that microarray data are usually analyzed in a univariate fashion and thus normalization is often intended to recover an accurate measure of differential gene expression between two or more conditions rather than the correlation of any pair of genes, even if this may be crucial not only for reverse engineering but also for other methods, such as hierarchical clustering.16 Total Spectral Area normalization 5 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 44

TSA: The concentration of each metabolite is divided by the sum of the concentration for all the measured metabolites in a given sample. An alternative version, excluding lactate, glucose, and urea concentrations (depending on the biofluid) has been proposed.17 In TSA, it is assumed that the total amount of metabolites excreted in different samples is approximately the same. Note that this normalization procedure is equivalent to the numerical example shown in the first simulated example. Other approaches have been proposed. Normalization to Internal Standard IS: The concentration of each metabolite is divided by the concentration of an internal reference metabolite; in the case of urine, a common approach is to divide by the creatinine concentration.18 The assumption is that creatine excretion (also called creatinine clearance) into urine is constant and thus its concentration is directly related to the urine concentration. However this may not be true in the presence of external stimuli of pathophysiological conditions. Normalization to the Urine Output UO: The urine output is the volume of urine excreted per hour per kilogram of mass. Normalization is performed by multiplying the raw metabolite concentrations by urine output.17 Probabilistic Quotient Normalization PQN: This starts with a TSA normalization of concentration profiles, followed by the calculation of a reference profile (median or baseline). For each metabolite, the quotient between its concentration in a given sample and it concentration in the reference sample is calculated and the median of all quotients is estimated. All variables of the test profile are then divided by the median quotient.19 PQN can also be performed without the initial TSA normalization step. However, we consider here the original approach, which was also investigated in the review by Kohl at al. 5 Cyclic Loess Normalization Loess: Given any two samples in the data set, the difference m and the average a of the logarithm of the concentrations are calculated. A normalization curve is fitted to the line obtained by plotting m versus a, which is the so-called MA plot proposed by Dudoit et al.20, The rationale is that if the two samples are properly normalized, the MA plot will show a cloud of points scattered along the m = 0 axis.21 The normalization curve is fitted using the Loess (locally weighted scatterplot smoothing) approach, where simpler non-linear models are fitted to subsets of the data with the aim of describing local variation.22 The normalization curve is subtracted from the original values and the overall normalization procedure is performed by considering all possible pairs of samples. The overall procedure is repeated until convergence. Contrast Normalization

6 ACS Paragon Plus Environment

Page 7 of 44

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Contrast: This method is also based on the MA plot idea. The data are first logarithmically transformed and then an orthonormal transformation is applied so that the rows become a set of orthogonal contrasts; the logarithmic transformation is taken to stabilize the error variance. Normalization of the transformed data is carried out, similarly to the Cyclic Loess normalization, by fitting a set of normalization curves. A robust distance measure based on the Euclidean norm is used to ensure that the normalization is independent of the orthonormal transformation chosen. Finally, data are mapped back to the original input space.23 Quantile Normalization Quantile: Quantile normalization aims to make the samples have the same distribution of metabolite.21 The rationale is that the quantile–quantile plot should be a straight diagonal line if the two samples are the same. This idea can be extended to n samples. The procedure consists in sorting the metabolite concentrations into ascending order for each sample separately; the mean across all the columns of the n reordered samples is taken. Then, the metabolite with the largest concentration in a sample is given the mean of the largest concentrations across the samples, the metabolite with the second largest concentration in a sample is given the mean of the second largest concentrations across the samples, and so on. By doing so, those constant values can be assigned to different metabolites since different metabolites can have the largest concentrations in different samples. It should be noted that after Quantile normalization, all samples contain the same concentration values, albeit differentially distributed across metabolites.5

Linear Baseline Normalization Linear: This uses a scaling factor to linearly map each sample to a baseline sample, which is usually derived by taking the median or the mean concentration profile of the subset of samples. The scaling factor is computed for each sample as the ratio of the mean concentration of the baseline to the mean intensity of the spectrum. The intensities of all spectra are then multiplied by the corresponding estimated scaling factors.21 This approach assumes the existence of a linear relationship between the samples and the baseline. Non-Linear Baseline Normalization Li–Wong: This method extends the idea of baseline normalization, assuming the existence of a non-linear relationship between the samples and the baseline. The concentration profiles are normalized by mapping them to the baseline concentration using a normalization curve, which is defined as the smoothing spline fitted on the metabolites that are similar in rank in both samples.24 The baseline sample is defined as the median spectrum and the normalization acts in such a way as to align along a diagonal the plot of a given sample and the baseline sample. The selection of metabolites that have similar ranks in the two samples ensures that the normalization curve is calculated over on only a subset of rank-invariant metabolites, thus performing a feature-selection step. Cubic Spline Normalization 7 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 44

CSN: A further extension of non-linear approaches, this method aims to conduct normalization by making the distribution of the metabolite concentrations similar across all samples, similarly to quantile normalization. A baseline sample is built by computing the geometric or arithmetic mean of the concentration of each metabolite over all samples. A set of evenly distributed quantiles is taken from both the baseline and target samples and used to fit a smooth cubic spline. The process is iterated several times, shifting the set of quantiles by a small offset each time. Finally, a spline function generator uses the generated set of interpolated splines to fit the parameters of a natural cubic spline.25 Consistently with reference [5] the baseline spectrum is constructed by using the arithmetic mean. Variance Stabilization Normalization VSN: This approach combines normalization with stabilization of the metabolite variances: different solutions are available (see the review paper by Kohl et al.5 and references therein). VSN aims to keep the variance constant over the entire data range. It corrects for betweensample variations by linearly mapping all concentration profiles to the first, after which adjustment of the variance of the data is applied.26 Here the version implemented in the R package “vsn” is used,26 where first, the sample to sample variation is reduced by linearly mapping the concentration of each sample to a reference sample (the first in the data set); then the variance is adjusted by applying an inverse hyperbolic sine transformation.26 This method is a hybrid between object-wise normalization and a scaling procedure since it performs both reduces the sample-to-sample variation and adjusts the variance of different metabolites. It is included for consistency with references 5 and 9.

Network concepts Some terms and definitions used in the network analysis are presented here. Networks, which are a graphic representation of objects (called nodes) and their relationships (described by links or edges), can be conveniently described using a matrix, termed an adjacency or connectivity matrix. The rows and columns represent the nodes (in this case metabolite concentrations). The non-zero elements of such matrices are real numbers describing the strength of the relationship between any two nodes and are usually in the [−1,1] range for correlation, in the [0,+∞) range for mutual information, or in the [0,1] range for probability. By imposing a threshold on these values, binary (0,1) adjacency matrices are obtained. The node degree (or connectivity) is defined as the number of connections between a given node and all other nodes: it can be obtained by summing over the columns (or the rows) of the (weighted) adjacency matrix. Network inference The PCLRC (Probabilistic Context Likelihood of Relatedness on Correlations) algorithm3 is based on a modification of the CLR (Context Likelihood of Relatedness) algorithm [using correlation instead of mutual information (MI) to measure similarity between profiles] and rests on the iterative sampling of the dataset. At each of the 104 iterations, 75% of the samples 8 ACS Paragon Plus Environment

Page 9 of 44

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

from the full dataset are considered to calculate the correlation matrix; from this, only the highest 30% (in absolute value) of correlations are kept. The method produces a weighted adjacency matrix W containing an estimate of the correlation likeliness between any two metabolites expressed as a probability measure: a threshold value of 0.95 is imposed on these probabilistic networks to obtain from W a binary adjacency matrix A = {aij}:

  aij    

1 0

if Wij  0.9

(5)

otherwise

Analysis of correlation matrices COVSCA (COVariance Simultaneous Component Analysis) is a model that was recently introduced to analyze communalities and differences across a set of Sk (k = 1, 2, ..., K) covariance matrices simultaneously.27 As an adjacency matrix can be considered a particular case of a covariance matrix, this method can be used to model communalities and differences between adjacency matrices. In COVSCA, the matrices are approximated as a combination of a limited number (L 10 while values μi for i ≤ 10 were randomly chosen in a such way that a PLS-DA model aiming to discriminate data sets X1 and X2 would yield a classification model with an area under the receiver curve (AUROC) equal to 0.77, and thus a model of moderately strong quality. For the X2 data, the urine output was taken to be equal to the measured one and was corrupted with the addition of Gaussian noise N(0,σ2) with σ2 = 0.01.

Software Normalization was carried out in the R environment33 using the script available in the supplementary material of the paper by Kohl et al.5 Subsequent analysis was carried out in the 11 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 44

Matlab environment (Matlab R2016a, MathWorks, Natick, MA, US), including PCA. Matlab codes for performing Parallel analysis and COSVCA and the R code for implementing the PCLRC algorithm are available at http://semantics.systemsbiology.nl/. PLS-DA was performed using the MLPLSDA tool box available at http://www.bdagroup.nl. RESULTS AND DISCUSSION Which correlation measure? The correlation measures most frequently used in metabolomics are the Pearson’s and Spearman’s correlation measures; however, correlation between metabolites should be assessed using Spearman’s correlations rather than Pearson’s, although the latter may be preferred in the case of limited sample size, since linearity cannot always be assumed.34 In contrast to Pearson’s correlation, the Spearman's coefficient is not a measure of the linear relationship between two variables but assesses how well an arbitrary monotonic function can describe a relationship between two variables, without making any assumptions about the frequency distribution of the variables.35 Since in the case of metabolic/metabolomic data the assumption of linearity may not be fully justified, the results presented and discussed in the remainder of the paper are derived using Spearman’s correlations when not stated otherwise. Normalization alters the data correlation structure The (dis)similarities among the correlation matrices obtained from the data matrix X after different normalization methods have been applied can be investigated by fitting a COVSCA model. Figure 3 shows a scatter plot of the first two components of the model, which indicates that all normalization methods affect, to a certain extent, the correlation structure: QPN, VSN, Spline, and Quantile normalizations result in similar correlation patterns. UO generates a correlation matrix that is more similar to the one obtained from the original data; Li–Wong and Contrast normalization mostly affect the correlation structure. It is interesting to note that TSA and Linear based normalization produce data that have the same correlation matrix (the corresponding points overlap); indeed, data vectors are normalized to 1 in TSA and to 22.83: the latter number comes from the fact that a scaling factor (each for every sample) is used to linearly map each sample to the baseline, here calculated as the median urinary profile over the n samples. Normalization affects the magnitude and significance of correlations Following the nomenclature introduced in 34, we divide the correlations into three levels: low (|ρ| ≤ .6), medium (0.6 < |ρ| < .8), and high (|ρ| ≥ .8). As shown in Figure 4A, low correlations are the most abundant in the raw data. This is a very common situation in metabolomics34: low correlations can happen even when metabolites are neighbors in a metabolic pathway because the variance in the enzymes that control them affects them in equal amounts and different directions. Indeed, this is what happens to the majority of metabolite pairs, and it is a consequence of the systemic nature of metabolic control.34

12 ACS Paragon Plus Environment

Page 13 of 44

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Normalization affects the magnitude of correlations, as shown in Figure 4A: VSN, Spline, Quantile, Loess, PQN, Li-Wong, and IS normalizations tend to flatten the magnitude of correlations while Linear, Contrast, and TSA normalizations seem to preserve them. Interestingly, normalization to UO seems to magnify correlation values, increasing the number of medium and high correlations. The distributions of the correlation magnitudes differ across the different correlation matrices obtained after normalization: of the 66 possible pairwise comparisons (using a Kruskal-Wallis test), 52 are statistically significant at the 0.05 level after correction of P-values for multiple testing using Bonferroni’s correction, indicating the difference between the correlation magnitudes. Table 2 presents a list of the 14 nonsignificant comparisons. As a consequence of the changes induced in the correlation magnitude, normalization also affects the statistical significance of the correlation itself, as shown in Figure 4B: only TSA normalization produces significance patterns similar to those present in the original data. Correlations estimated from UO normalized data are almost all significant (P-value < 0.05), while the other methods reduce the number of significant correlations. Given p variables, there are (p – 1)p/2 possible pairwise correlations: selecting the statistically significant correlations turns into a multiple test procedure for which correction is necessary to protect against the risk of false positives. Although relevant, this will not be addressed here. Normalization affects the sign of correlations To define the associations between metabolites, the sign of correlations is usually disregarded. However, the knowledge of the sign of the correlation is fundamental for biological interpretation of the observed associations: a positive correlation occurs when the increase of the concentration of one metabolite is associated with a rise in the concentration of another metabolite, while a negative correlation occurs when the increase of the concentration of one metabolite is associated with a decrease of the concentration of another metabolite. Often, when several adjacent compounds in a metabolic pathway are positively correlated, the mechanism underlying the metabolite–metabolite association changes can be intuitively obvious.36 However, correlations between adjacent compounds in a pathway are not, in fact, always observed; see, for instance, the example of the glutamate–glutamine pair reported in reference 34. The case of negative correlations is more complicated: these are frequently seen between apparently unrelated compounds in widely separated pathways, suggesting the existence of unknown metabolic control mechanisms or other interactions.36 In this light, it is evident that the sign is fundamental for interpretation or for generating new hypotheses regarding metabolic mechanisms. Normalization affects the sign of correlations among metabolites. Figure 4C shows the percentages of positive and negative correlations observed upon normalization of the data: VSN, Spline, Quantile, Loess, and PQN normalization induce an almost 50% increase in negative correlations. This implies that for many metabolite–metabolite pairs, the observed correlation can be either positive or negative, depending on which method is applied to the raw data. Figure 4D offers an overview of the percentage of correlations whose sign changed after normalization (since the true correlation structure is not known, the comparison is made 13 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 44

with respect to the raw data; see also the section “Which Normalization Method to Use?”). As shown, UO normalization is the method that seems to alter the sign of correlations least. This can be understood by considering that UO normalization involves a multiplicative factor (i.e., the volume of urine excreted per hour per kilogram of subject mass), which is a physiological quantity that is not estimated from the concentration of the metabolites. Other methods, such as PQN, transform the concentration of each metabolite into a new quantity, which is a complicated function of the concentrations of all other metabolites, and this can affect the sign in an unpredictable way. This is also true for TSA normalization, which has a rather simple mathematical formulation; however, expressing the correlations of normalized data in terms of the normalization parameters seems to be a rather cumbersome problem: see section “Mathematical relationship between correlations in raw and normalized data”. A natural question arises as to whether there is an association between the change of the correlation sign and the magnitude of the correlation itself. Figure 5 shows the percentage of correlations whose signs are changed upon data normalization for each correlation level (low, medium, and high). In general, lower correlations are most affected, and this is reasonable since most normalization methods tend to level out correlation values (see Figure 4A). However, there is the notable exception of the Li–Wong normalization: almost 27% of high correlations (|ρ| ≥ .8) changed sign upon data normalization. For instance, consider cytosine, whose observed correlations in the raw data with pyruvate, tiglylglycine, alanine, hypoxanthine, 2-methylglutarate, cis-aconitate, 3-methylxanthine, tyrosine, and hippurate are in the range of 0.80 to 0.91; these values were flattened to values in the range of –0.28 to – 0.001, transforming high positive correlations into low negative correlations. Since such low values would likely be rounded off to zero, these correlations would be discarded, leading to a situation in which these metabolites are highly correlated in the raw data but not correlated at all in Li–Wong normalized data. We note that negative correlations observed in the normalized data are mostly low correlations. However, there are some noteworthy exceptions, which are summarized in Table 3. To what extent does normalization introduce spurious correlation? Referring back to the numerical simulation examples, we have seen that normalization induces spurious correlation (see Figure 2). To investigate the extent to which the different normalization methods induce spurious correlation, we permuted the experimental data in such a way that the mean and variance of every variable (metabolite) were preserved but the correlation structure present in the data was destroyed; after the permutation, all variables are uncorrelated. The different normalization methods were then applied on the permuted data matrix. Figure 6 shows histogram plots of the median correlation observed in the permuted data after normalization: all methods induce spurious correlation, although of different magnitudes and signs. PQN, VSN, Spline, and Quantile normalization induce spurious correlations with an average magnitude of 0.01; for the other methods, the magnitude is of the order of 0.5. Only for non-normalized data is the distribution of correlations centered on 0, as expected. 14 ACS Paragon Plus Environment

Page 15 of 44

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

The significance of spurious correlations was assessed by using Parallel analysis (see “Material and Methods”). We tested the null hypothesis that no correlations exist among variables against the alternative one that a correlation exists among two (or more) variables. Choosing a confidence level of α = 0.05 and repeating the overall procedure K = 1000 times, we expect the null hypothesis to be rejected 5% of the time if H0 is true (i.e. there is no correlation in the data): a rejection rate larger than 5% will indicate the presence of spurious correlation induced by the normalization method. The results are summarized in Table 4. For non-normalized data, the rejection rate (i.e. the actual α level) is 0.05, as expected. For both PQN and VSN, the rejection rate is very close to what is expected, indicating that these methods do not introduce spurious correlations in the data; the same can also be concluded, if to a lesser extent, for Quantile and Spline normalizations. All other methods induce spurious correlations in the data. Interpretation of the UO normalization results deserves some further comments. This normalization procedure involves a multiplicative factor to be determined experimentally and not numerically estimated from the data. It is clear that while it is possible to permute metabolite concentrations across samples, it is not possible to permute the normalization factor. In the simulations, given a sample in the data matrix, the normalization factor used (i.e. urine output) was always the same and was equal to the one used for non-permuted data for the corresponding sample. For this reason, we also implemented another simulation scheme for the UO normalization, where each sample in the permuted data was multiplied by a random factor that was lognormally distributed with the same mean and variance observed in the experimental UO factors: the results are unchanged.

Consequences for network inference We have observed that normalization influences both the magnitude and the significance of correlations. This can have a serious impact when correlations are used as the basis for inference of biological networks. To set the scene, we start by considering a simple, but often used, approach to define association networks: it consists in creating an adjacency matrix from the weighted correlation matrix by imposing a threshold on the correlation |ρij| between any pair of metabolites. Often a threshold is also imposed on the associated P-value, P:   1 if ij   (and P   ) aijM   otherwise  0

(11)

Here the index M runs on the normalization methods applied on the raw data. This kind of thresholding on the correlation value is called hard thresholding. The resulting adjacency matrix depends on the given threshold τ (and on the significance level α) and thus it is clear that given the same τ (and α), different networks can be obtained from the same data. Figure 7 shows the results of a COVSCA modeling of the adjacency matrices obtained when imposing a threshold of τ = 0.6 and α = 0.05 on the correlation matrices obtained from data normalized with the different strategies. Two groups appear: networks obtained from Linear, 15 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 44

TSA, Li–Wong, Contrast, and UO normalized data are clustered around the network obtained from raw data (Cluster I); Loess, PQN, VSN, Quantile, and Spline normalized data output very similar networks (Cluster II), and the spread among the networks is smaller than the one observed in Cluster I. Notably, data normalized using IS (normalization to creatinine) result in a network that seems to be different from all the others. It is interesting to compare Figure 7 with Figure 3: it appears that imposing a hard threshold levels out some of the differences introduced by the normalization procedure; this is somewhat obvious since the thresholding force the correlations to 0 or 1, depending on whether they are below or above the threshold, so that the actual value of the correlation is lost. However, the thresholding shrinks the differences between Linear, TSA, Li–Wong, and Contrast normalized data networks around the raw data networks and enhances the differences between this group of networks and the other group, creating two well separated groups. Soft approaches have been proposed, especially in the field of gene expression (see for instance WGCNA2), consisting in taking the transformation (ρij)β and taking β in such a way that the resulting network shows a scale-free topology. Although this approach could be justified when considering large-scale regulatory networks consisting of thousands of genes, imposing such a restriction on metabolite–metabolite association networks, which usually comprise a limited number of metabolites (usually in the range of ~100), may be less justified, since scale-freeness may be not justified for such small networks. However, since normalization affects the magnitude of correlations, it will also affect networks obtained by soft thresholding: since the exponentiation to a β power (which is usually > 1) tends to preserve high correlations while pushing lower values to 0, adjacency networks obtained from VSN, Spline, Loess, and PQN will be sparse since the corresponding correlation matrices mostly contain low correlations. Other approaches have been proposed to infer metabolite–metabolite association networks starting from the correlation matrix, which dispose from the need to set a threshold on the correlation value defining a probabilistic measure based on resampling. As an example, we consider the PCLRC algorithm, which, starting from the correlation matrix, assigns to each correlation value a significance measure that has a probabilistic interpretation and that is independent of the magnitude of the correlation itself (see “Materials and Methods” for more details). When data normalized with different methods are used to infer the biological association networks, we again observe different results in terms of the topology of the obtained networks. For the sake of brevity, we focus on metabolite connectivity, as described by the node degree, which is the number of links connecting each node (metabolite). The node degree for cytosine is shown in Figure 8A: it varies considerably when using different normalization methods, with the extreme cases of Li–Wong normalization, where cytosine is not associated with any other metabolite, and TSA, where it is associated with 28 other different metabolites. The agreement of the overall node degree profiles for all metabolites between the different methods is diverse, as shown in panel B of Figure 8. The overall structure for networks inferred from TSA and Li–Wong normalized data is shown in Figure 9: it appears that the 16 ACS Paragon Plus Environment

Page 17 of 44

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

TSA network is densely connected while the Li–Wong network is scarcely connected; in other words, the same metabolites have different connectivity patterns and this may have dramatic consequences for biological interpretation because there is a special interest in highly connected nodes, the so-called hubs. Indeed, in almost all cases of biological networks, there is evidence of the existence of (a few) hubs that are considered to play crucial biological roles; for instance, systematic gene deletion studies and the analysis of the yeast protein– protein interaction network revealed that highly connected nodes are more likely to be essential for survival. In this light, the fact that metabolite connectivity is so strongly dependent on the normalization methods applied is a reason for major concern, since it will drive any subsequent biological discussion.

Effects on mutual information Association networks can also be inferred by considering mutual information (see for instance the ARACNE,37 CLR,38 or MIDER39 algorithms to mention just a few) rather than correlations. The mutual information MI between two discrete variables X and Y is defined as

n

MI ( X ;Y )   p( xi , yi ) log i, j

p( xi , yi ) p( xi ) p( yi )

(12)

where p(xi,yj) is the joint probability distribution function of X and Y, and p(xi) (respectively p(yj)) indicates the probability that X = xi (respectively Y = yj) . In the present case, X and Y represent concentrations of blood metabolites. It has been shown40 that if two variables are from a bivariate normal distribution and if the continuous variables are opportunely discretized, it is possible to establish a relationship between corr(x,y) and a normalized version of MI(x,y) (MI is usually in [0, ∞]) using the formula

MI (x, y) 

log(1    [corr( x, y )]2 ) (1  )   log( )

(13)

Equation (13) maps the interval [0,1] for correlation to the interval [0.1] for mutual information;   0.43m0.3 and   n 2.2 , where n is the sample size. This formula has been derived for Pearson’s correlation but has been found to hold also for other correlation measures; for more details, we refer the reader to 40.

17 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 44

Since Equation (13) establishes a link between mutual information and correlation, it follows that normalization can also affect mutual information and thus also network inference methods based on this association measure.

Normalization affects principal component analysis and discriminant analysis Principal component analysis The effects of normalization on PCA results were discussed in 6, 8. On the basis of the inspection of the score plots, differences were observed in the PCA model obtained from data to which different normalization procedures were applied. However, PCA models should be compared in terms of the loadings and scores, since the loadings define the model and it is through inspection of the loading that the model should be interpreted. Figure 10 shows the loadings (i.e. the relative contribution of each variable to the PCA model) of the PCA model fitted to data normalized with the different procedures. However, PCA is very sensitive to differences in the variance among variables, and it is advised that loadings tend to be slightly different, but these differences disappear when further processing (i.e. autoscaling or Pareto scaling) is applied before PCA (not shown). Two dominant metabolites (glucose and urea) are always present (indicating that these metabolites have larger variance than the others), with the exception of Contrast and IS normalized data (where the importance of glucose is downgraded) and Loess normalized data, in which greater importance is given to hippurate. The score plots of the corresponding models are shown in Figure 11. It can be noted that some structure seems to exist in the data, but the way in which the samples are clustered depends on the normalization applied.

Partial least squares discriminant analysis As correlations are also exploited in PLS regression, which aims to maximize the covariance between the response and the predictors, it is of interest to investigate how normalization may affect the results of PLS modeling. Here we address the simple case of PLS-discriminant analysis between two groups of samples in a classical case-control scenario, which is one of the most frequent scenarios in metabolomics.28 Since the focus is on investigation of the effects of the normalization procedure, no further treatment was applied after normalization (such as unit variance scaling or Pareto scaling) prior to PLS-DA modeling, although in some cases it can be advisable. However, it should be noted that VSN is a scaling method. Prediction results are given in Table 5 in terms of Q2, DQ2, number of misclassifications (NMC), and AUROC.29 The results are striking: normalization greatly affects the performance of the PLS model; (almost) perfect separation can be obtained for Spline, VSN, Quantile, Contrast, and Loess normalization, while the application of other methods sharply decreases the quality of the classification model. In the review by Kohl et al.,5 the effect of normalization was also investigated on a support vector machine classification model: in that 18 ACS Paragon Plus Environment

Page 19 of 44

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

case (and with different data), the results obtained were more consistent and the differences between the models were less pronounced (see Table 2 in the abovementioned paper) than those obtained here for PLS-DA. Which normalization method should be used? Given a biological problem, the interest is usually in inferring the characteristics of the correlation matrix underlying the problem (i.e. at a population level) starting from the measured samples. In this light, a legitimate question is to ask which normalization methods should be applied to obtain the most accurate estimation of the real underlying correlation structure. The performance of some of the methods presented here has been investigated on the basis of the ability to retrieve the known fold changes in metabolite abundances in designed experiments where several compounds of known concentrations have been spiked in urine specimens with different levels of dilution.5, 9, 13 While the real concentrations are known, there is no pre-imposed (known a priori) correlation structure among the different metabolites, since spiking is performed in a univariate fashion, hampering the use of such designed data to investigate how normalization affects the correlation patterns. This reflects the difficulty of generating experimental data with a known covariance/correlation structure. The problem can be partially overcome by in silico simulation of data with known distributional properties. However, simulations can only partially reproduce the complex patterns of (nonhomogeneous) variance/covariance and (usually) unknown distributional properties of metabolomic data.28 Estimation of correlation patterns is in general also affected by the sample size used and experimental noise present in the data. Since the focus here is on normalization, these parameters were not explored: the sample size was set to mimic a small/medium metabolomic experiment (for an overview of the effect of sample size on covariance and network estimation, see references 41 and 42) and the noise variance was kept low (see “Material and Methods”); the influence of (correlated) experimental noise on correlation patterns is investigated in references 7 and 43. We generated a data set with a known population correlation matrix (see “Materials and Methods”) consisting of 81 variables observed in 150 observations on which a dilution factor was applied. Data were then normalized and the correlation matrix was calculated. Figure 12 shows the scatter plot of the correlation values observed in the normalized data against the known correlation imposed a priori. The first thing to be noted is that all methods induce spurious correlation: the variables in the data set have correlations (at the population level) that range from 1 to 0, and in the latter case of zero correlation the observed correlation after normalization ranges between –0.5 and 0.5 for almost all methods, giving a clear indication that spurious correlations are induced among uncorrelated variables. The situation is particularly severe for IS and TSA, and for the latter this was also noted in a study in the closely related field of statistical total correlation spectroscopy.44 In general, methods like Loess, Li–Wong, and Contrast normalization perform poorly, at least on this example (suboptimal performance of the Li–Wong approach was also observed in the 19 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 44

paper by Hochrein et al. in the case of no or small metabolic dysregulation,9 but this method performed well in the case of reverse engineering from microarray data16). The poor performance of the Li–Wong approach may be explained by the fact that it looks for a subset of quasi rank-invariant features to build the normalization curve. This may not be a problem in transcriptomics, where thousands of genes are measured, but may be critical when a smaller number of features are considered. The suboptimal performance of Loess and Contrast normalization (also methods native to transcriptomics) could also be related to the problem of fitting the normalization curve with a limited number of data points. The other methods perform reasonably well and allow recovery of the original correlation pattern in the case of medium or high (original) correlation magnitude. Results for the normalization to urine output again require attention: this approach requires the estimation of a physiological parameter, in contrast to the others, where normalization parameters are numerically estimated from the data. In this simulation we have considered the urine output to be directly proportional to the dilution factor applied when generating the data; under this assumption, UO normalization performs well on these data. Recently, model-based approaches have been suggested and have shown good performance when the requirement is to normalize designed data or in the presence of batch effects.15 In reference 17, the authors performed UO normalization before network reconstruction using WCGNA,2 a choice that we support. Indeed, in the case of urine samples, normalization to urine output seems to be a sensible choice, although these measurements may be difficult to obtain.

OVERALL DISCUSSION We have offered recognition of the impact of the normalization procedure on the correlation structure of experimental data. Advantages and disadvantages of these normalization approaches have been discussed in references 6-8 but not from a correlation perspective or in the frame of network inference. It was found that the Contrast, Quantile, Li–Wong, and Spline normalization approaches perform better than no normalization for the detection of differentially expressed metabolites, with Quantile normalization performing better in a classification context,5 while UO normalization may cause putative biomarkers to decrease. However, we found problematic behavior of the Li–Wong algorithm on both real and simulated data. Working on biomarker selection in a chromatographic context, several authors advised against the use of TSA,7, 44 which we also found to affect the correlations (see Table 3), and suggested the use of PQN. Recently, the effect of TSA and IS normalization was also discussed in the frame of compositional data analysis in the PARAFAC context,45 and it was shown that when centered logarithm ratio (clr) coordinates46 are applied, they always lead to the same clr coordinates.

MATHEMATICAL RELATIONSHIP BETWEEN CORRELATIONS IN RAW AND NORMALIZED DATA 20 ACS Paragon Plus Environment

Page 21 of 44

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

The standard correlation coefficient (Pearson’s ρ using the moment formula) between two variables x and y is defined as

 xy 

cov( x, y ) var( x ) var( y )

(14)

where var and cov are the variance and covariance, respectively. This formula defines correlations as standardized covariances in the interval [–1, 1]. In practical applications, the covariance and variance of x and y are replaced by the sample estimates, which may differ from the actual population values owing to a limited sample size or correlated measurement noise. If one considers some functions f(x) and f(y), it is in general not possible (or it is extremely cumbersome) to express  f ( x ) f (y) as a function of the variance-covariance of the original variables. However this is possible if the functions f ( x)  xu and f ( y )  yu are considered: if u = v = z, this is one of the cases illustrated in the numerical examples. The problem of defining the variance and the covariance of the product of random variables has been addressed in the past and is relevant in the regression context to describe interaction effects.47, 48 For our purpose, we start by considering the covariance of the products x·z and y·z, which can be easily derived by Equation (11) in reference 48. If the three variables are multivariate normally distributed, cov(x z, y z) can be expressed as cov( xz, yz )  E(x)E(y) cov(z,z)+ E(x)E(z) cov(z, y)+ + E(z)E(y) cov(x,z)+ E(z)E(z) cov(x, y) cov(z,z)+

(15)

 cov(x,z) cov(z, y)

where E(*) indicates the expected value of *. In the case that the three variables are mutually independent, their covariance is null; remembering that cov(z,z) = var(z), Equation (15) can be simplified to:

cov( xz, yz)  E ( x) E ( y )cov( z, z )  E ( x) E ( y ) var( z )

(16)

This indicates that even if x and y are independent (and thus  xy  0 , since cov(x,y)), the correlation between xz and yz is, in general, not zero; actually  xz , yz   xy  0 if var(z) = 0, which happens if z is constant. Indeed, multiplying by a positive constant does not affect the correlation between variables. More precisely, cov(xz, yz) = 0 if the pair (z, y) is expectationand covariance-independent from z. 48 Given the definition of the correlation coefficient, the variance of the products xz and yz also needs to be considered. As shown in references 47 and 48, this is a function of several quantities (see Equation (5) in the latter reference), including the original variances and a term cov(x,z), which reduces to 21 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

var( xz)  E 2 ( x) var( z)  E 2 ( z ) var(x)  var(x) var(z)

Page 22 of 44

(17)

If x and z are independent (uncorrelated), Equation (17) and the analogous equation for the product yz can be plugged into (14) together with (16) to obtain the correlation coefficient for the product of i.i.d multivariate random variables, which will, in general, be different from that of the original x and y variables. This example reflects what happens when samples are normalized by multiplying them by an external variable z such as in the case of normalization to total urine output (see “Material and Methods” for more details). The treatment of the case of the correlation of x/z and y/z (which describes the normalization to an internal standard such as creatinine) is even more complicated as it involves the computation of quantities such as var(x/y) and E(x/y), which must be estimated by means of approximations. However, Pearson derived approximated expressions for the correlation between the variables x/z and y/z.10

CONCLUSIONS We have shown that the ways in which we look at correlations in data and interpret them are strongly affected by the normalization approach applied. Since this step is fundamental and needs to be applied to compensate for sample-to-sample variation, the problem of selecting the normalization procedure becomes crucial. It is not possible to provide a recipe that will fit all situations, as the most suitable approach depends on the data at hand and the kind of analysis that is going to be performed. If the focus is on metabolite–metabolite association, it is necessary to understand which kind of variation is conserved or destroyed by the different normalization methods, because this will dictate the correlation pattern observed. This may not be an easy task; in general, it is not possible to theoretically derive the correlation of normalized data starting from the measured the raw data. Even when the normalization method can be expressed with a simple mathematical relationship, as in the case of normalization to urine output or normalization to an internal standard, calculations become cumbersome. It can be advisable to perform the analysis on data normalized with different strategies and to compare the results with those obtained with the raw data: the resulting correlation matrices can then be modeled in a framework for the modeling of covariance/correlations like COVSCA27 to highlight differences and similarities and to make a choice that is biologically driven, favoring interpretation. For instance, in the context of a differential network analysis, when the focus is to investigate differences in networks pertaining to different pathophysiological statuses (see for instance reference 3, one can privilege normalization methods that enhance such differences.

22 ACS Paragon Plus Environment

Page 23 of 44

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

ACKNOWLEDGMENTS

This work was partially supported by the European Commission funded FP7 project INFECT (Contract No. 305280).

23 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 44

REFERENCES (1) Steuer, R., Review: On the analysis and interpretation of correlations in metabolomic data Briefings in bioinformatics 2006, 7, 151-158. (2) Zhang, B.; Horvath, S., A General Framework for Weighted Gene Co-Expression Network Analysis Statistical applications in genetics and molecular biology 2005, 4. (3) Saccenti, E.; Suarez-Diez, M.; Luchinat, C.; Santucci, C.; Tenori, L., Probabilistic Networks of Blood Metabolites in Healthy Subjects As Indicators of Latent Cardiovascular Risk J. Proteome Res 2014, 14, 1101--1111. (4) Tsuchiya, Y.; Takahashi, Y.; Jindo, T.; Furuhama, K.; Suzuki, K. T., Comprehensive evaluation of canine renal papillary necrosis induced by nefiracetam, a neurotransmission enhancer Eur. J. Pharmacol. 2003, 475, 119-128. (5) Kohl, S. M.; Klein, M. S.; Hochrein, J.; Oefner, P. J.; Spang, R.; Gronwald, W., State-ofthe art data normalization methods improve NMR-based metabolomic analysis Metabolomics 2012, 8, 146-160. (6) Warrack, B. M.; Hnatyshyn, S.; Ott, K.-H.; Reily, M. D.; Sanders, M.; Zhang, H.; Drexler, D. M., Normalization strategies for metabonomic analysis of urine samples J. Chromatogr. B 2009, 877, 547-552. (7) Filzmoser, P.; Walczak, B., What can go wrong at the data normalization step for identification of biomarkers? J. Chromatogr. 2014, 1362, 194-205. (8) Lusczek, E. R.; Nelson, T.; Lexcen, D.; Witowski, N. E.; Mulier, K. E.; Beilman, G., Urine metabolomics in hemorrhagic shock: Normalization of urine in the face of changing intravascular fluid volume and perturbations in metabolism Journal of Bioanalysis & Biomedicine 2012, 2011. (9) Hochrein, J.; Zacharias, H. U.; Taruttis, F.; Samol, C.; Engelmann, J. C.; Spang, R.; Oefner, P. J.; Gronwald, W., Data Normalization of 1H NMR Metabolite Fingerprinting Data Sets in the Presence of Unbalanced Metabolite Regulation J. Proteome Res 2015, 14, 3217-3228. (10) Pearson, K., Mathematical Contributions to the Theory of Evolution.--On a Form of Spurious Correlation Which May Arise When Indices Are Used in the Measurement of Organs Proceedings of the royal society of london 1896, 60, 489-498. (11) Zhang, S.; Zheng, C.; Lanza, I. R.; Nair, K. S.; Raftery, D.; Vitek, O., Interdependence of signal processing and analysis of urine 1H NMR spectra for metabolic profiling Analytical Chemistry 2009, 81, 6080-6088. (12) Van Den Berg, R. A.; Hoefsloot, H. C. J.; Westerhuis, J. A.; Smilde, A. K.; Van Der Werf, M. J., Centering, scaling, and transformations: improving the biological information content of metabolomics data BMC Genomics 2006, 7, 142. (13) Torgrip, R.; Åberg, K.; Alm, E.; Schuppe-Koistinen, I.; Lindberg, J., A note on normalization of biofluid 1D 1H-NMR data Metabolomics 2008, 4, 114-121. (14) Sysi-Aho, M.; Katajamaa, M.; Yetukuri, L.; Orešič, M., Normalization method for metabolomics data using optimal selection of multiple internal standards BMC Bioinformatics 2007, 8, 1-17. (15) Jauhiainen, A.; Madhu, B.; Narita, M.; Narita, M.; Griffiths, J.; Tavaré, S., Normalization of metabolomics data with applications to correlation maps Bioinformatics 2014, 30, 2155-2161. (16) Lim, W. K.; Wang, K.; Lefebvre, C.; Califano, A., Comparative analysis of microarray normalization procedures: effects on reverse engineering gene networks Bioinformatics 2007, 23, i282-i288.

24 ACS Paragon Plus Environment

Page 25 of 44

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(17) Lusczek, E.; Lexcen, D.; Witowski, N.; Mulier, K.; Beilman, G., Urinary metabolic network analysis in trauma, hemorrhagic shock, and resuscitation 2013, 9, 223-235. (18) Jatlow, P.; McKee, S.; O’Malley, S. S., Correction of urine cotinine concentrations for creatinine excretion: is it useful? Clin. Chem. 2003, 49, 1932-1934. (19) Dieterle, F.; Ross, A.; Schlotterbeck, G.; Senn, H., Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabonomics Analytical chemistry 2006, 78, 4281-4290. (20) Dudoit, S.; Yang, Y. H.; Callow, M. J.; Speed, T. P., Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments Statistica sinica 2002, 111-139. (21) Bolstad, B. M.; Irizarry, R. A.; Åstrand, M.; Speed, T. P., A comparison of normalization methods for high density oligonucleotide array data based on variance and bias Bioinformatics 2003, 19, 185-193. (22) Cleveland, W. S.; Devlin, S. J., Locally weighted regression: an approach to regression analysis by local fitting Journal of the American statistical association 1988, 83, 596-610. (23) Åstrand, M., Contrast normalization of oligonucleotide arrays J. Comput. Biol. 2003, 10, 95-102. (24) Li, C.; Wong, W. H., Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application Genome Biol 2001, 2, 1-11. (25) Workman, C.; Jensen, L. J.; Jarmer, H.; Berka, R.; Gautier, L.; Nielser, H. B.; Saxild, H.-H.; Nielsen, C.; Brunak, S.; Knudsen, S., A new non-linear normalization method for reducing variability in DNA microarray experiments Genome biol 2002, 3, 1-16. (26) Huber, W.; Von Heydebreck, A.; Sültmann, H.; Poustka, A.; Vingron, M., Variance stabilization applied to microarray data calibration and to the quantification of differential expression Bioinformatics 2002, 18, S96-S104. (27) Smilde, A. K.; Timmerman, M. E.; Saccenti, E.; Jansen, J. J.; Hoefsloot, H. C. J., Covariances Simultaneous Component Analysis: a new method within a framework for modeling covariances J. Chemometrics 2015, 29, 277-288. (28) Saccenti, E.; Hoefsloot, H. C.; Smilde, A. K.; Westerhuis, J. A.; Hendriks, M. M., Reflections on univariate and multivariate analysis of metabolomics data Metabolomics 2014, 10, 361-374. (29) Szymańska, E.; Saccenti, E.; Smilde, A. K.; Westerhuis, J. A., Double-check: validation of diagnostic statistics for PLS-DA models in metabolomics studies Metabolomics 2012, 8, 3-16. (30) Westerhuis, J. A.; Hoefsloot, H. C. J.; Smit, S.; Vis, D. J.; Smilde, A. K.; van Velzen, E. J. J.; van Duijnhoven, J. P. M.; van Dorsten, F. A., Assessment of PLSDA cross validation Metabolomics 2008, 4, 81-89. (31) Glorfeld, L. W., An Improvement on Horn's Parallel Analysis Methodology for Selecting the Correct Number of Factors to Retain Educational and Psychological Measurement 1995, 55, 377-393. (32) Horn, J. L., A rationale and test for the number of factors in factor analysis Psychometrika 1965, 30, 179-185. (33) Team, R. C., R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 2013. In ISBN 3-900051-07-0: 2014. (34) Camacho, D.; de la Fuente, A.; Mendes, P., The origin of correlations in metabolomics data Metabolomics 2005, 1, 53-63.

25 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 44

(35) Hauke, J.; Kossowski, T., Comparison of values of Pearson's and Spearman's correlation coefficients on the same sets of data Quaestiones geographicae 2011, 30, 8793. (36) Madhu, B.; Narita, M.; Jauhiainen, A.; Menon, S.; Stubbs, M.; Tavaré, S.; Narita, M.; Griffiths, J. R., Metabolomic changes during cellular transformation monitored by metabolite–metabolite correlation analysis and correlated with gene expression Metabolomics 2015, 11, 1848-1863. (37) Margolin, A. A.; Nemenman, I.; Basso, K.; Wiggins, C.; Stolovitzky, G.; Favera, R. D.; Califano, A., ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context BMC Bioinformatics 2006, 7, S7-S7. (38) Faith, J. J.; Hayete, B.; Thaden, J. T.; Mogno, I.; Wierzbowski, J.; Cottarel, G.; Kasif, S.; Collins, J. J.; Gardner, T. S., Large-Scale Mapping and Validation of Escherichia coli Transcriptional Regulation from a Compendium of Expression Profiles PLoS Biol. 2007, 5, e8. (39) Villaverde, A. F.; Ross, J.; Morán, F.; Banga, J. R., MIDER: Network Inference with Mutual Information Distance and Entropy Reduction PLoS ONE 2014, 9, e96732. (40) Song, L.; Langfelder, P.; Horvath, S., Comparison of co-expression measures: mutual information, correlation, and model based indices BMC Bioinformatics 2012, 13, 1-21. (41) Saccenti, E.; Timmerman, M. E., Approaches to Sample Size Determination for Multivariate Data: Applications to PCA and PLS-DA of Omics Data J. Proteome Res 2016, 15, 2379-2393. (42) Suarez-Diez, M.; Saccenti, E., Effects of sample size and dimensionality on the performance of four algorithms for inference of association networks in metabonomics J. Proteome Res 2015. (43) Kaduk, M.; Hoefsloot, H. C. J.; Vis, D. J.; Reijmers, T.; van der Greef, J.; Smilde, A. K.; Hendriks, M. M. W. B., Correlated measurement error hampers association network inference J. Chromatogr. B 2014, 966, 93-99. (44) Craig, A.; Cloarec, O.; Holmes, E.; Nicholson, J. K.; Lindon, J. C., Scaling and normalization effects in NMR spectroscopic metabonomic data sets Analytical chemistry 2006, 78, 2262-2267. (45) Gardlo, A.; Smilde, A. K.; Hron, K.; Hrdá, M.; Karlíková, R.; Friedecký, D.; Adam, T., Normalization techniques for PARAFAC modeling of urine metabolomic data Metabolomics 2016, 12, 1-13. (46) Aitchison, J., The statistical analysis of compositional data 1986. (47) Goodman, L. A., On the exact variance of products Journal of the American Statistical Association 1960, 55, 708-713. (48) Bohrnstedt, G. W.; Goldberger, A. S., On the exact covariance of products of random variables Journal of the American Statistical Association 1969, 64, 1439-1442.

26 ACS Paragon Plus Environment

Page 27 of 44

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

TABLES

Table 1 List of normalization methods investigated in the present study together with the abbreviations used 1 2 3 4 5 6 7 8 9 10 11

Method Normalization to Internal Standard Normalization to Urine Output Normalization To Total Spectral Area Probabilistic Quotient Normalization Cyclic Loess normalization Contrast Normalization Quantile Normalization Linear baseline normalization Non-linear baseline normalization Cubic-Spline Normalization Variance stabilizing normalization

Abbreviation Field IS Biochemistry, Metabolomics UO Metabolomics

Reference

TSA

Metabolomics

17

PQN

Metabolomics

19

Loess

Transcriptomics

20, 22

CN QN Linear

Transcriptomics Transcriptomics Transcriptomics

23

Li–Wong

Transcriptomics

24

CSN

Transcriptomics

25

VSN

Transcriptomics

26

18

17

21 21

27 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 44

Table 2 Pairwise comparisons for which the difference distribution of the magnitude of the correlations obtained after normalization had been applied were found to be non-significantly different. All other differences were found to be significant at the 0.05 level after Bonferroni correction for multiple comparisons. Comparison Method 1 1 Contrast 2 Linear 3 Loess 4 Loess 5 Loess 6 PQN 7 PQN 8 PQN 9 PQN 10 Quantile 11 Quantile 12 Spline 13 TSA 14 TSA

Method 2 Li– Wong None VSN Quantile Spline VSN Loess Quantile Spline Spline VSN VSN Linear None

28 ACS Paragon Plus Environment

Page 29 of 44

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table 3 Correlation observed between some metabolites in the raw and normalized data according to method:

Metabolite A

Metabolite B

Tyrosine Tyrosine Trigonelline Niacinamide

Formate Formate Cytosine 1,6-Anhydro-βD-glucose

Correlation raw data 0.08 0.08 0.25 0.01

in

Method Loess TSA Contrast Spline

Correlation in normalized data –0.86 –0.80 –0.77 –0.72

29 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 44

Table 4 Observed rejection rate of the null hypothesis that data do not contain any correlation structure other than that expected by chance: the observed rejection rate should be contrasted with the theoretically expected value of 0.05.

1 2 3 4 5 6 7 8 9 10 11 12

Method TSA IS UO PQN Loess Contrast Quantile Linear Li–Wong Spline VSN None

Rejection rate 1 1 1 0.06 1 1 0.14 1 1 0.12 0.08 0.05

Table 5 Quality parameters for PLS-DA models built after different normalization procedures are applied on the data (simulated case-control study). 1 2 3 4 5 6 7 8 9 10 11 12

Method TSA IS UO PQN Loess Contrast Quantile Linear Li–Wong Spline VSN None

NMC 64 77 57 62 1 23 7 86 64 7 0 33

Q2 0.02 –0.34 –0.05 –0.05 0.89 0.41 0.63 –0.69 –0.25 0.36 0.95 0.21

DQ2 0.03 0.01 0.01 –0.03 0.92 0.56 0.81 –0.02 –0.03 0.72 0.97 0.25

AUROC 0.59 0.39 0.61 0.64 1.00 0.99 0.99 0.28 0.56 0.99 1.00 0.77

30 ACS Paragon Plus Environment

Page 31 of 44

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

FIGURE LEGENDS Figure 1 Correlation patterns observed in the functions of two random variables as a function of the sample size. Here variables are lognormally distributed to avoid the problem of division by zero in the example corr(x/z, y/z). For each sample size, 1000 instances of x, y, and z are realized. The central mark indicates the median, and the bottom and top edges of the box indicate the 25th and 75th percentiles, respectively. Outliers [values greater than q3 + w × (q3 – q1) or less than q1 – w × (q3 – q1), where q1 and q3 are the 25th and 75th percentiles of the sample data, respectively] are indicated by open circles. Figure 2 Scatter plots of the correlation values (raw data versus creatinine normalized) of a subset of metabolites (out of the 62 measured in the experimental data set) for which the normalization to creatinine greatly affects the magnitude (Panel A) and the sign (Panel B) of the correlation. The two subsets are different; hence the different numbers of dots (i.e. metabolites) in the two panels. Figure 3 COVSCA modeling of the correlation matrices obtained by applying different normalization procedures on a metabolomics experimental data set. Note that the points representing TSA and Linear normalized correlation matrices overlap Figure 4 A) Percentage (%) of high, medium, and low correlations calculated after the different normalization methods have been applied to the data. B) Percentage (%) of statistically significant correlations calculated after the different normalization methods have been applied to the data. C) Percentage (%) of positive and negative correlations calculated after the different normalization methods have been applied to the data. D) Percentage (%) of correlations that change sign (with respect to raw data) after normalization methods have been applied to the data Figure 5 Percentage (%) of high, low, and medium correlations that change sign upon normalization of the data Figure 6 Distribution of the average correlation observed in the 1000 realizations of permuted experimental data after normalization has been applied Figure 7. COVSCA modeling of the network adjacency matrices, calculated using Equation (6) from the correlation matrix of different normalized data Figure 8 A) Node degree of cytosine from the networks estimated from data normalized with different approaches. B) Scatter plot of the node degree of the 61 metabolites: comparison of results obtained from TSA and UO (blue) and Loess and Li–Wong (red). Figure 9 Circular representation of the topology of the network obtained using TSA (Panel A) and Li–Wong (Panel B) normalized data. Networks are obtained using the PCLRC algorithm.

31 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 44

Figure 10 Loadings for the first principal components for a PCA model fitted on the data normalized with the different procedures. Data are Pareto scaled before PCA. Variable #23 is glucose, #25 is hippurate, and #57 is glucose. Figure 11 Score plots of the first two principal components for a PCA model fitted on the data normalized with the different procedures. Data are Pareto scaled before PCA. Variable #23 is glucose, #25 is hippurate, and #57 is glucose.

32 ACS Paragon Plus Environment

Page 33 of 44

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 1 111x89mm (300 x 300 DPI)

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 2 203x152mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 34 of 44

Page 35 of 44

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 3 68x66mm (600 x 600 DPI)

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 4 154x154mm (600 x 600 DPI)

ACS Paragon Plus Environment

Page 36 of 44

Page 37 of 44

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 5 61x58mm (600 x 600 DPI)

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 6 122x95mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 38 of 44

Page 39 of 44

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 7 66x62mm (600 x 600 DPI)

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 8 118x184mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 40 of 44

Page 41 of 44

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 9 141x301mm (600 x 600 DPI)

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 10 80x43mm (600 x 600 DPI)

ACS Paragon Plus Environment

Page 42 of 44

Page 43 of 44

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 11 79x42mm (600 x 600 DPI)

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

For TOC Only 254x190mm (132 x 132 DPI)

ACS Paragon Plus Environment

Page 44 of 44