Visualization, Quantification, and Alignment of Spectral Drift in

Jan 19, 2017 - Analysis of spectral data across population scale cohorts, however, ..... This approach not only allows for global visualization of dat...
0 downloads 0 Views 6MB Size
Subscriber access provided by UNIV OF CALIFORNIA SAN DIEGO LIBRARIES

Letter

Visualization, Quantification and Alignment of Spectral Drift in Population Scale Untargeted Metabolomics Data Jeramie D. Watrous, Mir Henglin, Brian Claggett, Kim Lehmann, Martin G. Larson, Susan Cheng, and Mohit Jain Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.6b04337 • Publication Date (Web): 19 Jan 2017 Downloaded from http://pubs.acs.org on January 24, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Analytical Chemistry is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 17

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11

Visualization, Quantification and Alignment of Spectral Drift in Population Scale Untargeted Metabolomics Data Jeramie D. Watrous1†, Mir Henglin2†, Brian Claggett2, Kim Lehmann1, Martin G. Larson3,4, Susan Cheng2, 3‡*, and Mohit Jain1‡*

†: denotes equal contribution ‡: denote equal contribution *: denotes corresponding authors

12 13 14

Affiliations:

15

1

16 17

2

Cardiovascular Division, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, 02115

18

3

Framingham Heart Study, Framingham, MA, 01702

19

4

Biostatistics Department, School of Public Health, Boston University, Boston, MA, 02118

Departments of Medicine & Pharmacology, University of California San Diego, La Jolla, CA, 92093

20 21

For Correspondence:

22 23 24 25 26 27 28

Mohit Jain MD PhD University of California, San Diego mjain@ucsd.edu Susan Cheng MD MPH Brigham and Women’s Hospital, Harvard Medical School scheng@rics.bwh.harvard.edu

1 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

Abstract

2

Untargeted liquid-chromatography-mass spectrometry (LC-MS) based metabolomics analysis of human

3

biospecimens has become among the most promising strategies for probing the underpinnings of human

4

health and disease. Analysis of spectral data across population scale cohorts, however, is precluded by

5

day-to-day non-linear signal drifts in LC retention time or batch effects that complicate comparison of

6

thousands of untargeted peaks. To date, there exists no efficient means of visualization and quantitative

7

assessment of signal drift, correction of drift when present, and automated filtering of unstable spectral

8

features, particularly across thousands of data files in population scale experiments. Herein, we report the

9

development of a set of R-based scripts that allow for pre- and post-processing of raw LC-MS data. These

10

methods can be integrated with existing data analysis workflows by providing initial pre-processing bulk

11

non-linear retention time correction at the raw data level. Further this approach provides post-processing

12

visualization and quantification of peak alignment accuracy, as well as peak-reliability based parsing of

13

processed data through hierarchical clustering of signal profiles. In a metabolomics dataset derived from

14

~3000 human plasma samples, we find that application of our alignment tools resulted in substantial

15

improvement in peak alignment accuracy, automated data filtering, and ultimately statistical power for

16

detection of metabolite correlates of clinical measures. These tools will enable metabolomics studies of

17

population scale cohorts.

18 19

Introduction

20

Metabolomics is among the most promising strategies for deep phenotyping of complex biological

21

samples.1-3 When applied to human biospecimens, particularly across population scale cohorts of

22

thousands of individuals, metabolomics allows for identification of biomarkers years prior to onset of

23

disease, understanding of disease pathophysiology, elucidation of mechanisms of action for

24

pharmacologic agents, and even a means for personalization of therapy.1-11 Continual advances in high

25

resolution mass spectrometry coupled to liquid chromatography separation (LC-MS) enable routine

26

separation and measurement of thousands of discrete chemical features from biospecimens in an

27

‘untargeted’ fashion without the need for chemical labeling, enabling broad and comprehensive coverage

28

of the metabolome.12-15 As is the case with any analytical approach, even the most well controlled LC-MS

29

methods will experience fluctuations (‘signal drift’) during constant measurement of thousands of unique

30

samples over extended periods of continuous analysis time.16-20 It is therefore essential that major sources

31

of drift, including changes instrument sensitivity, chromatographic retention time, and sample extraction

32

efficiency, be accounted for during analysis to allow for comparison of individual spectral peaks across

33

all samples, to limit errors in signal processing, and to avoid introduction of systematic artifacts.18,21 The 2 ACS Paragon Plus Environment

Page 2 of 17

Page 3 of 17

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

1

major source of signal drift over an extended LC-MS run arises from non-linear fluctuations in LC

2

retention time, resulting from column aging, slight variability in mobile phase preparations, and

3

differences in sample matrix complexity, all of which can influence a given metabolites interaction with

4

the LC column’s stationary phase.19,22-25 Similarly, with untargeted metabolomics, a certain subpopulation

5

of metabolites will inevitably be chemically unstable under any given method parameters, resulting in

6

rapid time-dependent changes in signal intensity16,20,21 as well as irregularity in sample recovery during

7

preparation (i.e. batch effects).17,26 Together, these signal drifts are amplified in population scale

8

untargeted metabolomics studies across thousands of samples. Accurate extraction and interpretation of

9

population scale LC-MS data requires robust, automated methods for detection, visualization and

10

quantification of signal drift.

11

Given the ubiquitous nature of these limitations in LC-MS metabolomics experiments, significant effort

12

has been put forth in creating software for handling signal normalization due to instrument drift26-29 and

13

batch effects17,30-32 as well as creating QC metrics for evaluating LC-MS stability and reliability.18,20,21,33

14

Similarly, prior efforts have also reported approaches for addressing non-linear retention time

15

correction.19,22-24 While most of these tools provide valuable capabilities in data processing, we often find

16

they are not able to be easily incorporated with established analysis pipelines. More importantly, existing

17

methods are generally not capable of processing thousands of LC-MS files in a single analysis, a

18

requirement for population scale metabolomic studies. In addition, many reported approaches also require

19

the use of vendor specific data formats or abstract file conversions which severely limit options for further

20

downstream data processing. To combine data handling tools into an integrated software pipeline, open

21

source analysis suites such as XCMS25, Mzmine34, OpenMS35 and SMART36 have been developed to

22

allow users the ability to perform non-linear peak alignment with downstream tools for peak picking and

23

normalization. However, a common shortcoming of many of these packages is their inability to perform

24

computationally intensive functions (e.g. as unsupervised non-linear retention time correction) on

25

thousands of data files in a single analysis as well as advanced data quality filtering such as the ability to

26

determine signal reliability based on a peaks cumulative signal profile.

27

To expand functionality and flexibility of data handling across thousands of untargeted LC-MS files, we

28

sought to create a suite of simple software modules that could be easily added to and incorporated with

29

existing analysis pipelines in order. The first module allows for initial pre-processing and bulk non-linear

30

retention time correction of raw spectral data using user indicated landmark peaks. Corrected data can

31

thereafter be imported into any user preferred compatible software with improvement in the quality of

32

secondary unsupervised non-linear alignment as well as allows for the use of less computationally

33

intensive linear peak alignment algorithms. The second module allows scoring of the quality of peak

3 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

alignment, allowing rapid setting optimization and assessment of data quality. Lastly, the third module

2

allows for filtering of data based on signal reliability by performing hierarchical clustering on peak profile

3

patterns and sorting based on the type of drift (e.g. batch effects, chemical/thermal instability, and

4

misalignment). This allows for each subset of data to be normalized using available software specifically

5

designed for correcting that type of drift. These R-based scripts were tested and validated using a

6

population scale dataset comprised of 2895 human plasma samples, with substantial improvement in peak

7

alignment accuracy and automated data filtering, ultimately resulting in greater statistical power for

8

identifying correlates between observed metabolites and clinical measures.

9

These modules are provided to the metabolomics community to enable population based metabolomics

10

studies.

11 12

Results and Discussion

13

Pre-alignment of LC-MS Data

14

Analysis of LC-MS data is typically performed by detecting all masses within each scan, extracting each

15

mass individually and determining if chromatographic peaks are present within the mass trace (i.e. peak

16

picking), aligning corresponding peaks between samples (i.e. feature alignment), removing of unreliable

17

chromatographic features (typically through basic filtering as well as manual inspection), and lastly

18

statistical analysis with respect to phenotypes (Fig. 1A).37 While mass detection and peak picking

19

algorithms are relatively robust and consistent across most software suites, feature alignment and signal

20

filtering can differ greatly in terms of approach and capability. Accurate pairing of corresponding

21

chromatographic features among not only sequentially analyzed samples but also subsequent data sets

22

requires highly accurate modeling of complex non-linear retention time drift.19,22-24 Failure to account for

23

this drift may introduce systematic errors in signal processing such as spectral features becoming "split"

24

where their retention times have drifted outside of the allowable tolerance window causing the software to

25

treat one feature as multiple features (Fig 1B). Similarly, when neighboring isobaric peaks are present,

26

different features may become cross-aligned whereby signal from different compounds are treated as a

27

single feature (Fig 1B). Moreover, we have observed that most non-linear unsupervised alignment

28

algorithms do not offer sufficient flexibility in the alignment model to adjust for minor errors in final

29

alignment. We therefore sought to create a supervised, landmark-based, non-linear, pre-processing

30

alignment algorithm that allows for bulk retention time correction in order to improve the accuracy of

31

secondary alignment. To examine the extent of retention drift is present in ‘real world’ metabolomics

32

data, LC-MS data was collected from 2895 human plasma samples, prepared and analyzed as 33 batches

4 ACS Paragon Plus Environment

Page 4 of 17

Page 5 of 17

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

1

of 93 samples each. Absolute retention time drift for 24 isotopically labeled internal standards (Table S-

2

1) acting as known landmarks and 60 endogenous unknown landmark features (i.e. ubiquitous peaks

3

present in every sample with little variability) were plotted against run retention time for each batch to

4

reveal non-linear drift patterns with variability in these patterns among prepared batches (Fig 2A and S-

5

1). Given the complexity and size of data, low order polynomials do not offer sufficient degrees of

6

freedom to accurately account for drift across the entire run time and changes in this drift profile between

7

samples necessitate that each sample be corrected independently. We therefore employed running average

8

cubic smoothing splines as the foundation for building drift models with varying degrees of freedom (i.e.

9

4, 8, 16 and 32 degrees of freedom). Resulting models using either internal standard landmarks or

10

endogenous landmarks were highly comparable (data not shown) and therefore all landmarks were used

11

for analysis. Using these models, the scan retention times within each .mzXML raw data file were

12

modified to fit each model and were exported as new corrected .mzXML files, which could then be

13

imported into any compatible software suite. To test the accuracy and quality of alignment, a metric was

14

established for quantification of feature misalignment. Major shifts in retention time typically occur

15

between batches, and as such we searched for mass features that were statistically absent from within at

16

least one of the 33 batches. These batch-specific perturbations were visualized by plotting the percent

17

misalignment versus cumulative metabolite number and quantified by calculating the area under the curve

18

(AUC) for each trace (Fig 2B). Pre-aligned (“corrected”) and uncorrected, raw data were loaded in

19

Mzmine 2.21 and all spectra features above 50,000 counts (roughly 6,000 peaks) were aligned using the

20

Join Aligner linear alignment function.34 Quantitation of misalignment revealed that 82% of peaks were

21

misaligned within the uncorrected dataset, confirming the ubiquitous nature of alignment issues in

22

untargeted metabolomics data. In contrast, only 14% were misaligned in the corrected data using 16-

23

degrees of freedom. Figure 2C shows example extracted ion chromatograms for m/z 269.1782, a mass

24

that exhibits an abundance of isobaric species, from each of the 33 batches, both before and after retention

25

time correction is applied, with marked improvement following institution of the correction model. We

26

further examined whether landmark based pre-alignment improves more sophisticated, non-linear

27

alignment tools that are present in common processing platforms such as Mzmine (RANSAC algorithm34)

28

or XCMS online (http//xcmsonline.scripps.edu).25 Analysis of the entire metabolomics dataset of ~3000

29

files was not possible, however, given size and memory constraints present with both of these processing

30

platforms. A subset of 200 random metabolomics data files was therefore selected and examined.

31

Metabolomics data corrected using our landmark alignment pre-processing step improved data quality

32

and overall alignment, with fewer missing values with both XCMS (Fig S-2) and Mzmine RANSAC (Fig

33

S-3). One of the major advantages of performing retention time correction within the raw data before

34

peak deconvolution is that the end user is able to import the corrected data into their desired data analysis 5 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

platform, regardless of the strength of its alignment algorithms. As such, pre-alignment of population

2

scale metabolomics data using a facile drift model may greatly improve data quality and reduce

3

systematic error due to improper signal processing.

4

Statistical Association of Aligned Metabolomics Data

5

Typically, analysis software equipped with simple linear or low order polynomial based alignment

6

algorithms require retention time limits to be set using a large enough time window to encompass all

7

possible drift within the raw data. This can drastically increase the risk of cross-alignment with

8

neighboring isobaric features. For population scale metabolomics analysis, this cross alignment can have

9

major implications when correlating spectral information with clinical outcomes and can lead to both

10

false positives as well as false negatives, undermining the overall analysis. To determine the extent to

11

which the optimal alignment algorithm impacts the results of metabolite associations assessed in a large

12

population scale cohort, corrected and uncorrected LC-MS data for 2895 patient plasma samples were

13

loaded into Mzmine 2.21 and features were aligned using a simple linear alignment algorithm with a

14

0.025 minute window for corrected data and a 0.15 minute window for the uncorrected data (as 0.15

15

minutes represents the maximum observed retention time drift in the raw data for the 2895 samples). We

16

then performed linear and logistic regression analyses to examine the associations of each metabolite with

17

the common clinical features of age and sex, respectively. Corrected raw data showed marked

18

improvement in association with clinical phenotype, especially with respect to the magnitude of

19

associations assessed by beta coefficients (Fig S-4A) and statistical significance of associations indicated

20

by P values (Fig S-4B). For detecting metabolites associated with age, a continuous outcome, a total of

21

13.8% of all metabolites measured reached the Bonferroni threshold of statistical significance when

22

models were run using the uncorrected data, and 18.0% when models were run using corrected data

23

(P=0.002). When compared to the results of models run using corrected data, the models run using

24

uncorrected data produced 8.9% false negative results and 27.5% false positive results. Similar degrees of

25

high false positive and false negative levels were observed with uncorrected metabolite association with

26

gender, a binary outcome (data not shown). We therefore find that proper pre-alignment of untargeted

27

metabolomics data substantially improves both sensitivity and specificity for detecting potentially

28

important metabolite associations in population scale metabolomics analyses.

29

Visualization and Hierarchical Clustering of Aligned Spectra Features

30

Even when properly integrated and aligned, signal drift may be introduced into untargeted metabolomics

31

datasets due to non-processing errors, including day-to-day variances in sample preparation (batch

32

effects), changes in instrument sensitivity, or even simple chemical instability and breakdown of unstable

33

metabolites during the time between extraction and LC-MS analysis.16-18,21,33 While methods are available 6 ACS Paragon Plus Environment

Page 6 of 17

Page 7 of 17

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

1

for unsupervised correction of signal drifts16,17,22,24, to the best of our knowledge there does not exist a

2

facile method for parsing out LC-MS signal profiles based on the type of drift they exhibit so that further

3

processing can be performed based on the type of drift or removed if the drift is not correctable. We

4

reasoned that metabolites subject to batch effects, instrument drift or chemical instability may

5

demonstrate predictable patterns across large scale metabolomics experiments (Fig 3A) and that

6

metabolites exhibiting similar patterns may be visualized and clustered to potentially identify and remove

7

unreliable signals from further analysis. Using data from the 2895 human plasma samples following non-

8

linear retention time correction, the mean signal intensity from the resulting ~6000 spectral features was

9

plotted with respect to run order. Trace patterns across run order were clustered and those metabolites that

10

demonstrated significant fluctuations within a batch or across batches were automatically grouped

11

together. Resulting metabolite clusters were then parsed into groups based on signal reliability with

12

reliable signals colored blue and signals exhibiting significant levels of deviation within their intensity

13

profiles colored red and labeled as unreliable, allowing for rapid visualization of untargeted metabolomics

14

data (Fig S-5). Figure 3B highlights several examples of such signals with profiles for stable metabolites

15

being relatively invariant from batch-to-batch, whereas misaligned peaks exhibited batch ‘drop out’, cross

16

aligned metabolites or those metabolites sensitive to batch to batch variance in sample preparation

17

exhibited clear inter-batch shifts, and unstable metabolites exhibited both intra- and inter-batch decrement

18

in signal. This approach not only allows for global visualization of data quality within an experiment, but

19

also enables unbiased automatic parsing of features into groups based on their signal "reliability".

20 21

Conclusions

22

Proper feature alignment and peak quality filtering in LC-MS data sets is critically important for correctly

23

analyzing and interpreting spectral information, particularly across untargeted metabolomics datasets

24

including those used for chemotyping of population scale clinical cohorts. We have shown that

25

performing bulk pre-processing retention time correction at the raw data level not only allows end users

26

freedom to utilize analysis software lacking robust alignment algorithms, but also provides better quality

27

data extraction when used prior to secondary processing with software suites such as XCMS or Mzmine.

28

In addition, the ability to visualize and quantitate the degree of misalignment within a data set before

29

further processing is a critical checkpoint within any analysis pipeline. Lastly, using hierarchical

30

clustering to categorize signals based on the shape of the signal intensity profile across batches allows for

31

visualization and parsing of data based on signal reliability. These approaches can enhance the quality of

32

LC-MS data analysis with improved chromatographic alignment, thereby allowing for increased accuracy

7 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

in feature quantification, automated parsing of unreliable features, and greater downstream statistical

2

power.

3

This set of annotated R-based scripts is available for the metabolomics community at

4

http://www.jainlaboratory.org/tools/alignmentscripts.html, along with user documentation and details.

5 6

Associated Content

7

Supporting Information

8

The Supporting Information is available free of charge on the internet at http://pubs.acs.org.

9

Experimental methods, list of isotopically labeled internal standards, endogenous landmark peak

10

examples, XCMS and Mzmine output comparison for corrected versus uncorrected data, heatmaps from

11

statistical outcomes analysis, example output of feature clustering.

12 13

Acknowledgements

14

This work was supported by funding from the National Institutes of Health (R01 HL134168: SC, MJ), the

15

American Heart Association (CVGPS Pathway Award: SC, MJ), the Doris Duke Charitable Foundation

16

(MJ, SC), and the Tobacco-Related Disease Research Program (MJ, JDW).

17 18 19 20 21 22 23 24 25 26 27

8 ACS Paragon Plus Environment

Page 8 of 17

Page 9 of 17

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

Analytical Chemistry

Figures

2 3 4 5 6

Figure 1. Misalignment of spectral data. A) Typical LC-MS data analysis workflow. Steps encompassed in red box indicated steps where drift correction was applied. B) Possible outcomes of improper alignment. Left column shows proper alignment of the two peaks within the retention time window (dashed lines), whereas misaligned (center) and cross-aligned (right) peaks are shown along with their impact on final data outcomes.

7 8 9

9 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1 2 3 4 5 6 7 8 9

Page 10 of 17

Figure 2. Quantification of signal drift. A) Modeling non-linear retention time drift across four representative sample preparation batches (N~90 samples per batch). The blue line indicates mean retention drift relative to a reference sample whereas the clusters of black points indicate individual retention times for the 24 internal standards and 60 landmark peaks. Drift models were generated for each of the 2895 samples and used to create corrected .mzXML files. B) Quantifying peak alignment quality by plotting the percent of misaligned features based on batch-to-batch consistency within randomized patient samples and calculating the area under the curve (AUC) with a higher AUC value indicating better alignment. C) Example overlay of EIC's from each of the 33 batches for m/z 269.1782, which displays many isobaric structures, before and after retention time correction.

10 ACS Paragon Plus Environment

Page 11 of 17

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Analytical Chemistry

Figure 3. Hierarchical clustering based filtering of unreliable signals. A) Examples of possible intensity patterns that can occur during a large scale LC-MS experiment. Red line indicates mean intensity value for a chromatographic feature across all batches. B) Examples of parsed clustering output showing stable features (consistent means across all batches), misaligned features (signal loss across one or more batches), cross-aligned and batch affected features (regular or erratic shifting of mean signal between batches), and chemically unstable features (gradual increase or decrease in signal over each batch or across entire experiment).

11 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

Page 12 of 17

REFERENCES

(1) Johnson, C. H.; Ivanisevic, J.; Siuzdak, G. Nat Rev Mol Cell Biol 2016, 17, 451-459. (2) Nicholson, J. K.; Holmes, E.; Kinross, J. M.; Darzi, A. W.; Takats, Z.; Lindon, J. C. Nature 2012, 491, 384-392. (3) Nagana Gowda, G. A.; Raftery, D. Curr Metabolomics 2013, 1, 227-240. (4) Cheng, S.; Larson, M. G.; McCabe, E. L.; Murabito, J. M.; Rhee, E. P.; Ho, J. E.; Jacques, P. F.; Ghorbani, A.; Magnusson, M.; Souza, A. L.; Deik, A. A.; Pierce, K. A.; Bullock, K.; O'Donnell, C. J.; Melander, O.; Clish, C. B.; Vasan, R. S.; Gerszten, R. E.; Wang, T. J. Nat Commun 2015, 6, 6791-6799. (5) Wang, T. J.; Larson, M. G.; Vasan, R. S.; Cheng, S.; Rhee, E. P.; McCabe, E.; Lewis, G. D.; Fox, C. S.; Jacques, P. F.; Fernandez, C.; O'Donnell, C. J.; Carr, S. A.; Mootha, V. K.; Florez, J. C.; Souza, A.; Melander, O.; Clish, C. B.; Gerszten, R. E. Nat Med 2011, 17, 448-453. (6) Kaddurah-Daouk, R.; Kristal, B. S.; Weinshilboum, R. M. Annu Rev Pharmacol Toxicol 2008, 48, 653-683. (7) Kaddurah-Daouk, R.; Weinshilboum, R. M. Clin Pharmacol Ther 2014, 95, 154-167. (8) Snowden, S.; Dahlen, S. E.; Wheelock, C. E. Bioanalysis 2012, 4, 2265-2290. (9) Suhre, K.; Shin, S. Y.; Petersen, A. K.; Mohney, R. P.; Meredith, D.; Wagele, B.; Altmaier, E.; Deloukas, P.; Erdmann, J.; Grundberg, E.; Hammond, C. J.; de Angelis, M. H.; Kastenmuller, G.; Kottgen, A.; Kronenberg, F.; Mangino, M.; Meisinger, C.; Meitinger, T.; Mewes, H. W.; Milburn, M. V.; Prehn, C.; Raffler, J.; Ried, J. S.; Romisch-Margl, W.; Samani, N. J.; Small, K. S.; Wichmann, H. E.; Zhai, G.; Illig, T.; Spector, T. D.; Adamski, J.; Soranzo, N.; Gieger, C. Nature 2011, 477, 54-60. (10) Shah, S. H.; Kraus, W. E.; Newgard, C. B. Circulation 2012, 126, 1110-1120. (11) Beger, R. D.; Dunn, W.; Schmidt, M. A.; Gross, S. S.; Kirwan, J. A.; Cascante, M.; Brennan, L.; Wishart, D. S.; Oresic, M.; Hankemeier, T.; Broadhurst, D. I.; Lane, A. N.; Suhre, K.; Kastenmüller, G.; Sumner, S. J.; Thiele, I.; Fiehn, O.; Kaddurah-Daouk, R. Metabolomics 2016, 12. (12) Alonso, A.; Marsal, S.; Julià , A. Front Bioeng Biotechnol 2015, 3. (13) Zamboni, N.; Saghatelian, A.; Patti, Gary J. Mol Cell 2015, 58, 699-706. (14) Patti, G. J.; Yanes, O.; Siuzdak, G. Nat Rev Mol Cell Biol 2012, 13, 263-269. (15) Johnson, C. H.; Ivanisevic, J.; Benton, H. P.; Siuzdak, G. Anal Chem 2015, 87, 147-156. (16) Fernandez-Albert, F.; Llorach, R.; Garcia-Aloy, M.; Ziyatdinov, A.; Andres-Lacueva, C.; Perera, A. Bioinformatics 2014, 30, 2899-2905. (17) Wang, S. Y.; Kuo, C. H.; Tseng, Y. J. Anal Chem 2013, 85, 1037-1046. (18) Gika, H. G.; Theodoridis, G. A.; Wingate, J. E.; Wilson, I. D. J Proteome Res 2007, 6, 3291-3303. (19) Podwojski, K.; Fritsch, A.; Chamrad, D. C.; Paul, W.; Sitek, B.; Stuhler, K.; Mutzel, P.; Stephan, C.; Meyer, H. E.; Urfer, W.; Ickstadt, K.; Rahnenfuhrer, J. Bioinformatics 2009, 25, 758-764. (20) Fang, M.; Ivanisevic, J.; Benton, H. P.; Johnson, C. H.; Patti, G. J.; Hoang, L. T.; Uritboonthai, W.; Kurczy, M. E.; Siuzdak, G. Anal Chem 2015, 87, 10935-10941. (21) Gika, H. G.; Theodoridis, G. A.; Earll, M.; Snyder, R. W.; Sumner, S. J.; Wilson, I. D. Anal Chem 2010, 82, 8226-8234. (22) Kirchner, M.; Saussen, B.; Steen, H.; Steen, J. A. J.; Hamprecht, F. A. J Stat Soft 2007, 18. (23) Lange, E.; Tautenhahn, R.; Neumann, S.; Gröpl, C. BMC Bioinf 2008, 9, 375. (24) Nordström, A.; O'Maille, G.; Qin, C.; Siuzdak, G. Anal Chem 2006, 78, 3289-3295. (25) Smith, C. A.; Want, E. J.; O'Maille, G.; Abagyan, R.; Siuzdak, G. Anal Chem 2006, 78, 779-787. (26) Ejigu, B. A.; Valkenborg, D.; Baggerman, G.; Vanaerschot, M.; Witters, E.; Dujardin, J.-C.; Burzykowski, T.; Berg, M. OMICS 2013, 17, 473-485. (27) Dalby, A. R.; Karpievitch, Y. V.; Nikolic, S. B.; Wilson, R.; Sharman, J. E.; Edwards, L. M. PLoS ONE 2014, 9, e116221. (28) Veselkov, K. A.; Vingara, L. K.; Masson, P.; Robinette, S. L.; Want, E.; Li, J. V.; Barton, R. H.; Boursier-Neyret, C.; Walther, B.; Ebbels, T. M.; Pelczer, I. n.; Holmes, E.; Lindon, J. C.; Nicholson, J. K. Anal Chem 2011, 83, 5864-5872. (29) Sysi-Aho, M.; Katajamaa, M.; Yetukuri, L.; Orešič, M. BMC Bioinformatics 2007, 8, 93. 12 ACS Paragon Plus Environment

Page 13 of 17

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Analytical Chemistry

(30) Kuligowski, J.; Pérez-Guaita, D.; Lliso, I.; Escobar, J.; León, Z.; Gombau, L.; Solberg, R.; Saugstad, O. D.; Vento, M.; Quintás, G. Talanta 2014, 130, 442-448. (31) Kuligowski, J.; Sánchez-Illana, Á.; Sanjuán-Herráez, D.; Vento, M.; Quintás, G. The Analyst 2015, 140, 7810-7817. (32) Wehrens, R.; Hageman, J. A.; van Eeuwijk, F.; Kooke, R.; Flood, P. J.; Wijnker, E.; Keurentjes, J. J. B.; Lommen, A.; van Eekelen, H. D. L. M.; Hall, R. D.; Mumm, R.; de Vos, R. C. H. Metabolomics 2016, 12. (33) Fang, Z.-Z.; Gonzalez, F. J. Arch Toxicol 2014, 88, 1491-1502. (34) Pluskal, T.; Castillo, S.; Villar-Briones, A.; Oresic, M. BMC Bioinformatics 2010, 11, 395. (35) Sturm, M.; Bertsch, A.; Gröpl, C.; Hildebrandt, A.; Hussong, R.; Lange, E.; Pfeifer, N.; SchulzTrieglaff, O.; Zerck, A.; Reinert, K.; Kohlbacher, O. BMC Bioinformatics 2008, 9, 163. (36) Liang, Y.-J.; Lin, Y.-T.; Chen, C.-W.; Lin, C.-W.; Chao, K.-M.; Pan, W.-H.; Yang, H.-C. Anal Chem 2016, 88, 6334-6341. (37) Zhou, B.; Xiao, J. F.; Tuli, L.; Ressom, H. W. Mol BioSyst 2012, 8, 470-481.

15

16 17

For TOC Only

18

13 ACS Paragon Plus Environment

Feature 1

Sample 1 Sample 2 Sample 3

Quality Filtering

Statistical Analysis

Feature 2

Feature 1

Feature Alignment

Feature 2 Sample 2

Peak Picking

Sample 1

Mass Detection

Aligned

Feature 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

B

m/z 150

Sample 3

A

Analytical Chemistry

ACS Paragon Plus Environment

Page 14 of 17

Misaligned

Cross-aligned

A

Batch 10

Retention Time Drift (sec)

Batch 1 +5

+5

0

0

-5

-5 0

1

2

3

4

5

6

7

0

1

Batch 20 +5

0

0

-5

-5 1

2

3

4

5

3

4

5

6

7

Batch 30

+5

0

2

6

7

1000

750

500 Model DF 16 DF 8 DF 4 DF 32 Original

250

0

1

2

3

4

5

6

C

AUC 405 398 392 390 142

0

7

Retention Time (min)

Percent Misalignment

Linear Alignment of Corrected Data

Signal Intensity

Linear Alignment of Original Data

Signal Intensity

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

B

Analytical Chemistry

Cumulative Feature Count

Page 15 of 17

ACS Paragon Plus Environment

Retention Time (min)

Retention Time (min)

A

B

Signal

Signal

Page 16 of 17

Batch Effect

Stable Feature

Injection Order

Injection Order

Signal

Signal

Misalignment

Instrument Drift or Contaminant

Injection Order

Injection Order

Cross-alignment

Chemical Instability

Signal

Signal

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43

Analytical Chemistry

Injection Order

Stable

Misaligned

Injection Order

Cross-aligned / Batch Effect

ACS Paragon Plus Environment

Chemical Instability

~3000L CMSDat aSet s

Sampl eSpec icRTDr i f tModel i ng

Bat c h1 Page 17 of 17Analytical Chemistry

Bat c h10

Bat c h30

1 2 Stable Misaligned BatchEffect Unstable 3 4 5 ACS Paragon Plus Environment 6 7 AutomatedFeatureBinning RTCor r ec t i onWi t hi nRa wDat aF i l e