A Novel Feature Clustering Method Enables ... - ACS Publications

Jun 13, 2014 - and J. E. Prenni*. ,†,§. †. Proteomics and Metabolomics Facility, Colorado State University, Fort Collins, Colorado 80523, United ...
1 downloads 0 Views 907KB Size
Subscriber access provided by Swem Library | College of William and Mary & VIVA (Virtual Library of Virginia)

Technical Note

RAMClust: a novel feature clustering method enables spectral-matching based annotation for metabolomics data Corey David Broeckling, Fayyaz ul Amir Afsar Minhas, Steffen Neumann, Asa Ben-Hur, and Jessica E. Prenni Anal. Chem., Just Accepted Manuscript • Publication Date (Web): 13 Jun 2014 Downloaded from http://pubs.acs.org on June 16, 2014

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Analytical Chemistry is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

1

RAMClust: a novel feature clustering method enables spectral-matching based annotation for

2

metabolomics data

3

Broeckling, C.D1; , Afsar, F. A2.; Neumann, S4.. Ben-Hur, A2, Prenni, J.E1,3.

4 5

Corey D. Broeckling: Proteomics and Metabolomics Facility, Colorado State University, Fort Collins, CO

6

80523. 970-491-2273. [email protected]

7

Fayyaz ul Amir Afsar Minhas: Department of Computer Science, Colorado State University, Fort Collins,

8

CO 80523. 970-213-9093: [email protected]

9

Steffen Neumann: Department of Stress- and Developmental Biology, Leibniz Institute of Plant

10

Biochemistry, 06108 Halle, Germany. [email protected]

11

Asa Ben-Hur: Department of Computer Science, Colorado State University, Fort Collins, CO 80523. 970-

12

491-4068. [email protected]

13

Jessica E. Prenni: Proteomics and Metabolomics Facility, Colorado State University, Fort Collins, CO

14

80523. 970-491-0961. [email protected]

15 16

1. Proteomics and Metabolomics Facility, Colorado State University, Fort Collins, CO 80523

17

2. Department of Computer Science, Colorado State University, Fort Collins, CO 80523

18

3. Department of Biochemistry, Colorado State University, Fort Collins, CO 80523

19

4. Department of Stress- and Developmental Biology, Leibniz Institute of Plant Biochemistry, 06108

20

Halle, Germany

ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

Abstract: Metabolomic data is frequently acquired using chromatographically coupled mass

2

spectrometry platforms. For such datasets, the first step in data analysis relies on feature detection,

3

where a feature is defined by a mass and retention time. While a feature typically is derived from a

4

single compound, a spectrum of mass signals is more a more accurate representation of the mass

5

spectrometric signal for a given metabolite. Here we report a novel feature grouping method which

6

operates in an unsupervised manner to group signals from MS data into spectra without relying on

7

predictability of the in-source phenomenon. We additionally address a fundamental bottleneck in

8

metabolomics, annotation of MS level signals, by incorporating indiscriminant MS/MS (idMS/MS) data

9

implicitly: feature detection is performed on both MS and indiscriminant MS/MS data, and feature-

10

feature relationships are determined simultaneously from the MS and idMS/MS data. This approach

11

facilitates identification of metabolites using in-source MS and/or idMS/MS spectra from a single

12

experiment, reduces quantitative analytical variation as compared to single feature measures, and

13

decreases false positive annotations of unpredictable phenomenon as novel compounds. This tool is

14

released as a freely available R package, called RAMClustR, and is sufficiently versatile to group features

15

from any chromatographic-spectrometric platform or feature finding software.

16

Introduction:

17

Mass spectrometry (MS) has long been utilized for detecting and quantifying small molecules,

18

particularly when coupled to separation tools such as gas chromatography (GC), liquid chromatography

19

(LC), or capillary electrophoresis (CE). The strengths of these chromatographically coupled mass

20

spectrometry platforms have been leveraged toward global metabolite profiling approaches, or

21

metabolomics. The development of electrospray ionization (ESI) 1 was an important technological

22

milestone which allowed for the coupling of liquid separation methods to mass spectrometers. This

23

development obviated the volatility requirement imposed by gas chromatography and supported

ACS Paragon Plus Environment

Page 2 of 19

Page 3 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

1

development and expansion of both metabolomics and proteomics. Electrospray is considered a ‘soft’

2

ionization technique, by which the molecular ion of the compound is generally more dominant than that

3

achieved using ‘hard’ ionization methods such as electron impact ionization (EI). However, the ESI

4

process is imperfectly ‘soft’ and does produce some degree of in-source fragmentation. Furthermore,

5

secondary adducts, multimers, and fragmentation products of these can form during the ionization

6

process resulting in multiple observed ions representative of a single compound. These redundant

7

signals are effectively utilized for EI spectra to allow for spectral matching based annotation metabolite

8

signals.

9

Data analysis workflows which seek to detect mass signals in a non-targeted manner utilize both mass

10

and retention time-based specificity – the resulting signal is commonly referred to as a ‘feature’. In the

11

absence of co-elution, one feature originates from a single compound. However, the reciprocal is

12

largely untrue: a single compound can give rise to multiple features, as described above. Therefore,

13

many metabolomics data processing tools, including both commercial and open source tools attempt to

14

group features into spectra . Some grouping strategies are based on chemically meaningful and

15

predictable patterns reflecting known phenomenon. However, this approach can be compromised by

16

1.) interfering signals from co-eluting metabolites in complex samples that happen to look like

17

fragments, adducts, or isotopes and 2.) unpredictable mass spectral fragments, adducts, or isotopes. As

18

such, an unsupervised approach to grouping features is an attractive alternative. Previous tools

19

including CAMERA2, AMDIS3, and MSClust4 have attempted to address this issue but none of these make

20

full use of the non-targeted data. For example, CAMERA is biased toward the most abundant features

21

and utilizes discrete binning by retention time. MSClust also looks for coeulting and covarying features

22

and ultimately selects a representative ‘centrotype’ feature for downstream statistical analysis – the

23

majority of features are discarded. AMDIS works on a single data file, is generally not used for

ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

quantitation, and does not utilize high mass accuracy data. Further, all of these tools are designed for

2

single channel MS datasets.

3

Here we report the development of a novel metabolomics workflow constructed around indiscriminant

4

MS/MS data acquisition (idMS/MS), which employs high collision energy fragmentation without

5

precursor ion selection5, acquired concurrently with low collision energy MS data. Our method is based

6

on the premise that two features resulting from the same compound exhibit similarity in their retention

7

times and a high correlation in their abundance profiles across different samples within a dataset.

8

Based on this observation, we have developed a simple similarity function between features that allows

9

us to use hierarchical clustering to generate the spectra of chemical compounds by grouping features

10

from a single compound in a single cluster. Feature finding is conducted in both low and high collision

11

energy data, and a custom feature similarity score drives clustering of features into spectra suitable for

12

informed manual interpretation as well as automated database searching. This approach results in both

13

in-source MS and idMS/MS spectra for all detected features and enables spectral matching to public,

14

commercial, and custom spectral databases without additional experimentation.

15

Experimental:

16

Sample acquisition and preparation:

17

Equine CSF samples were obtained as previously described6. CSF was thawed at 4°C, and 100 µL of CSF

18

was precipitated with 400 µL of cold methanol. This solution was mixed thoroughly, incubated at -20°C

19

for 1 hr, and spun at 12,000 xg for 15 minutes to remove proteins. The supernatant was transferred to

20

autosampler vials for UPLC-MS analysis. The validation dataset consists of 50 urine samples, collected

21

from Swedish males. Samples were prepared by thawing the urine at 4°C, diluting with equal part

22

water, and centrifuging to remove particulates.

ACS Paragon Plus Environment

Page 4 of 19

Page 5 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

1

UPLC-MS data acquisition:

2

Metabolome analysis of CSF and urine samples were accomplished using a Waters Acquity UPLC coupled

3

to a Xevo G2 Q-TOF MS. Five µL of either protein depleted CSF or diluted urine was injected onto an HSS

4

T3 column (Waters, 1x100mm, 1.7µM), and eluted using a gradient of water to acetonitrile, each

5

containing 0.1% formic acid. The gradient held at 0.1% B for 1 minute, ramped to 95%B over 12 minutes

6

and held for 3 minutes before returning to 0.1%B and equilibrating for 3.9 minutes (20 minute run time).

7

The flow rate was held constant at 200 µl/minute. Eluent was ionized via positive mode electrospray

8

ionization, with capillary voltage set to 2.2 kV, cone to 30 V, extraction cone to 2, with source

9

temperature at 150°C and desolvation nitrogen gas set to 350°C at 800 L/hr. Before acquisition, the

10

instrument was calibrated via infusion of sodium formate to within 1 ppm error. Mass accuracy was

11

ensured via infusion of leucine enkaphalin lockmass, collected as a 0.5 second scan at 10 V collision

12

energy every 20 seconds. Sample data were acquired in MS^E mode, with alternating scans (0.2

13

seconds/scan, m/z 50-1200) collected at collision energy of 6 V (MS) or using a CE ramp from 15-30V

14

(idMS/MS). Each sample was injected in duplicate, with each set of injections completely randomized

15

for acquisition order. In addition, the samples were analyzed using data dependent acquisition mode

16

for traditional MS/MS experiments, with one DDA MS/MS spectrum acquired per MS scan, with a

17

minimum precursor intensity threshold of 200 counts per second. All data was acquired in centroid

18

mode.

19

Raw data conversion and processing:

20

Waters raw files were converted to cdf format using Databridge, which separates low collision energy

21

MS and high collision energy idMS/MS data into two separate cdf files. The lockmass function data was

22

discarded for this application. Feature detection (utilizing the centWave algorithm), an initial grouping

23

step using a wide bandwidth (3), retention time correction, regrouping using a narrow bandwidth (1.5),

ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

and peak filling was performed using XCMS7 (v. 1.32.0) in R8 (v. 2.15). CAMERA2 (v. 1.16.0) was used a

2

benchmark comparison, utilizing default values.

3

RAMClust approach:

4

The RAMClust approach was developed in Matlab and is currently fully implemented in R in a package

5

called RAMClustR, and is currently available via github (https://github.com/cbroeckl/RAMClustR).

6

Implementation in R allowed an XCMS object to be used directly as input. The data within the XCMS

7

object were extracted using the XCMS groupval function and was normalized to the total XCMS

8

extracted ion signal (the quantile9 method is an available option in RAMclustR). When a second collision

9

energy level is used (as is possible with Waters MS^e5 datasets, utilized in this study), the user directs

10

delineation of MS and idMSMS datasets using a tag located within the filename or filepath of the xcms

11

object. ramclustR is also capable of accepting properly formatted data matrices from other peak

12

detection tools, with the only requirements being 1. No more than one sample (or file) name column

13

and one feature name row 2. Feature names which contain mass and retention time separated by a

14

constant delimiter and 3. features in columns and samples in rows. If both MS and idMS/MS data are to

15

be imported the features names must be identical between the two datasets.

16

RAMclust similarity was calculated for the full feature matrix (within a user specified maximum allowed

17

retention time window). Metabomolics datasets can generate thousands to tens of thousands of

18

features, which can tax the memory of many desktop computers. To manage memory, we utilize the ff

19

package10, which allows for rapid temporary storage of large R objects using physical disk space rather

20

than in memory, and process large data matrices in square blocks (2000 features at a time by default).

21

The RAMclust similarity scoring utilizes a guassian function, allowing flexibility in tuning correlational

22

and retention time similarity decay rates independently, based on the dataset and the acquisition

ACS Paragon Plus Environment

Page 6 of 19

Page 7 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

1

instrumentation. The correlational relationship between two features can be described by either MS-

2

MS, MS-idMS/MS, or idMS/MS-idMS/MS values, and we use pearson’s correlation to calculate similarity:  = max  −1 − /  2 , −1 − /  2 , −1 − /  







 2  −  −   2! 

"

3

where /  is the correlation coefficient between # and # (i and j representing the peak areas $ %

4

in each sample for any two features), and σt and σr represent sigma values for the retention time and

5

correlational r value, respectively. Similarities were then converted to dissimilarities & = 1 −  for

6

clustering. The output similarity matrix was then clustered using average (for this study) or complete

7

linkage hierarchical clustering via that fastcluster11 package. The dendrogram was then cut using the

8

cutreeDynamicTree function in the package, dynamicTreeCut12. For this application, minimum module

9

size is set to 2, dictating that only clusters with two or more features are returned, as singletons are

10

impossible to interpret intelligently.

11

Cluster membership, in conjunction with the abundance values from individual features in the input

12

data, were used to create spectra. Mass was derived from the feature mass, and the abundance for

13

each mass in the spectrum was derived from the weighted mean of the intensity values for that feature.

14

These spectra were then exported as an msp formatted document, which can be directly imported by

15

NIST MSsearch, or used as input for MassBank13 or NIST msPepSearch

16

(http://peptide.nist.gov/software/ms_pep_search_gui/MSPepSearch.html) batch searching. Finally, the

17

cluster membership was then used to create a third dataset, SpecData, which represented the MS level

18

data after condensing feature intensities into spectral intensities using a weighted mean function, where

19

the more abundant signals contribute more to the spectral intensity.

20

Results and Discussion:

21

We developed and tested our approach using a UPLC-MS dataset of 38 samples of equine cerebrospinal

22

fluid, and subsequently validated the approach in an independent urine dataset (supplementary Figure

ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

1). XCMS7 was used for feature finding, retention time correction, and alignment and the resulting

2

dataset was subsequently normalized to total XCMS signal intensity for each sample. The output data

3

was then divided into low collision energy (MS) and high collision energy (idMS/MS) datasets, each with

4

dimensions of row number equal to the number of injections columns equal to the number of features

5

(21060, for the CSF dataset). Each cell of these datasets represents signal intensity at either low (MS) or

6

high (idMS/MS) collision energy. We developed a custom similarity matrix which is the product of two

7

Gaussian terms, one that considers the differences in retention times between two features and second

8

that considers the correlation between two features across all samples in the dataset. These two terms

9

have widths defined by σt and σr, respectively. This captures our intuition that two features are similar

10

if they are close in retention time and are correlated: both are required for two features to be grouped.

11

Following the computation of the similarity matrix, features are clustered using hierarchical clustering.

12

To generate discrete clusters from the resulting hierarchical clustering dendrogram, we then used the

13

DynamicTreeCut14 package in R. Cluster membership of each feature provides qualitative spectral

14

membership information, and the quantitative data is taken from the MS and idMS/MS datasets;

15

abundance values are calculated as the averaged signal intensity for each feature separately in both the

16

low and high collision energy datasets. Thus, for each cluster two spectra are recreated corresponding

17

to the low collision energy in-source spectra and the high collision energy counterparts.

18

Any feature clustering tool must demonstrate accuracy to be useful in reducing redundancy without

19

reducing biological coverage. One option to accomplish this is to compare the results of the clustering

20

to a small panel of known compounds that are spiked into a sample. While this is a valid approach, it

21

relies on the assumption that the chosen panel of compounds is representative of all the metabolites in

22

a complex biological matrix. Thus, to increase the breadth of our validation experiments we instead

23

assessed the accuracy of the clustering by comparison against MS/MS spectra acquired using a

ACS Paragon Plus Environment

Page 8 of 19

Page 9 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

1

traditional dependent acquisition (DDA) approach from the same CSF samples. All precursor ions that 1.)

2

could be mapped to a feature in the output dataset and 2.) contained greater than 10 product ions

3

were used as ‘valid’ spectra for comparison. These spectra represented known precursor-product ion

4

relationships from many of the major signals in the dataset, even if the identity of the compounds was

5

unknown. The spectra created by RAMClust were then compared to the DDA spectra and the dot

6

product spectral similarity score calculated as a measure of accuracy, as described previously15. While

7

the complexity of in-source and indiscriminant MS/MS signals is expected to be higher than DDA MS/MS

8

spectra for the same compound, more accurate clustering will still be revealed as relatively higher dot-

9

product similarity scores between the RAMclustR reconstructed spectra and the mapped DDA MS/MS

10

spectrum.

11

The RAMClust algorithm has several parameters that can be tuned by the user to improve clustering

12

accuracy. σt and σr represent guassian tuning parameters of retention time similarity and correlational

13

score, respectively, between feature pairs. The influence of these two parameters on the similarity is

14

depicted in Figure 1. These tuning parameters will allow the algorithm to be used with MS data from

15

any chromatographic platform. When idMS/MS data are available, correlational similarity can be

16

calculated between two features either at the level of either MS vs. MS, MS vs. idMS/MS, or idMS/MS

17

vs. idMS/MS. While the MS-idMSMS correlation theoretically represents the CID event most directly,

18

this relationship is subject to potential interfering signals in both data channels (MS and idMS/MS). In

19

practice, a strong correlational relationship at any of the three levels represents strong evidence of

20

precursor product relationships, thus the algorithm utilized the maximum correlational r-value of the

21

three relationships.

22

The influence of σt and σr on the average spectral similarity between RAMClust and DDA spectra was

23

rigorously evaluated at 441 combinations of parameter levels of σt and σr (figure 2a). These results

ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

revealed a plateau of high spectral similarity at values of σt =2 and σr = 0.5 (Figure 2a). This σt value was

2

approximately half the median peak width of the XCMS detected peak (max-min time for each individual

3

peak in the xcms object), indicating that we can directly use XCMS input to set this parameter without

4

user intervention – this holds true for an independent dataset of urine samples (supplementary data).

5

Correlation is a scale free statistic, and should be platform neutral, thus we used our observed optimal

6

value of 0.5 and can expect reasonable performance on any platform. Implementation of RAMclustR

7

using parameters which maximized MS/MS similarity between reconstructed spectra and DDA spectra

8

generated approximately 2500 clusters with at least two features (Figure 2b), and relatively few

9

singletons (Figure 2c). This algorithm generated a large stable region, indicating that it is robust to small

10

changes in parameter values. This stability generated strong MS/MS similarity even at ‘unreasonable’ σt

11

values (>200 seconds), so long as σr is proportionally high (Figure 2a). We interpret this as a scaling

12

phenomenon, as the dynamicTreeCut algorithm is responsive to tree ‘shape’ rather than an absolute

13

height14. The dynamicTreeCut maximum height parameter was also examined in conjunction with σt,

14

and revealed that the tree pruning step benefited from some precutting (Figure 2d), thus we employ a

15

default value of 0.3 for this parameter. These parameterization rules make the algorithm extremely

16

easy to use: when an XCMS object is used as input, the user needs to set none of these parameters, and

17

when a dataset is imported from other software only σt needs to be manually set. The output MS/MS

18

similarity using default RAMclust similarity scores was used to compare results against the only other

19

feature grouping tool in R, CAMERA. The results of this comparison indicated that RAMclust grouping of

20

features resulted in spectra that are more similar to DDA spectra than the results generated from

21

CAMERA’s groupFWHM, groupCorr, and groupDen functions (Table 1). This observation was validated

22

on a second LC-MS dataset of urine samples: RAMClust grouping resulted in clustering output that

23

better represents valid feature relationships, and consequentially, biological small molecule signals.

ACS Paragon Plus Environment

Page 10 of 19

Page 11 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

1

The spectra produced via RAMclust grouping can written to NIST MSP format for viewing and searching,

2

and can be submitted directly to the MassBank Database13 batch search tool, submitted for batch

3

searching to NIST msPepSearch, and/or viewed and searched via the NIST MSSearch program. All these

4

tools offer the ability to generate and search against custom libraries of spectra, and our lab is creating

5

libraries of in-source spectra toward this end. However, idMS/MS spectra recreated from the RAMClust

6

algorithm and workflow were highly similar to authentic NIST MS/MS database spectra (Figure 3a-c),

7

demonstrating that this workflow can take full advantage of existing resources.

8

Since RAMClust generated spectra accurately reflect spectra of authentic chemical standards, the

9

intensity of the spectra themselves can be used as the quantitative unit for downstream statistical

10

analysis. The intensity of the spectra were calculated using a weighted mean function of all the

11

component features, such that each value in the resulting dataset represents the quantitative signal

12

intensity value for each spectrum for each sample in the dataset. The use of spectra dramatically

13

reduced analytical variation through an averaging of measurement noise, as compared to either the

14

mean or median feature-based variation for each cluster (Figure 3d).

15

Conclusions:

16

Annotation of mass signals in non-targeted metabolomics experiments remains a significant bottleneck

17

and is arguably one of the most important challenges to the field as confident metabolite identification

18

is required for biological interpretation. In this report, we demonstrate a novel workflow utilizing

19

indiscriminant MS/MS data acquisition, expanded feature finding and a novel clustering algorithm to

20

group features based on both low and high collision energy data to generate spectra that are compatible

21

with publically available spectral search tools. The workflow allows for more efficient use of

22

instrumentation, reduced feature redundancy and false discovery rate correction burden for

23

downstream univariate statistical tests, improved analytical reproducibility, a more automated

ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

annotation workflow, and greatly increased confidence in the annotations as compared to accurate

2

mass-based searching alone. RAMClustR is available for download at

3

https://github.com/cbroeckl/RAMClustR.

4 5

Acknowledgements: Afsar, F. A was funded by the Fulbright Scholarship Program of the U.S.

6

Department of State and the Higher Education Commission of the Government of Pakistan.

7

References

8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

1. C. M. Whitehouse, R. N. Dreyer, M. Yamashita, J. B. Fenn, Electrospray Interface for Liquid Chromatographs and Mass Spectrometers. Analytical chemistry 1985, 57. 675-679, DOI: Doi 10.1021/Ac00280a023. 2. C. Kuhl, R. Tautenhahn, C. Böttcher, T. R. Larson, S. Neumann, CAMERA: An Integrated Strategy for Compound Spectra Extraction and Annotation of Liquid Chromatography/Mass Spectrometry Data Sets. Analytical chemistry 2012, 84. 283-289, DOI: Doi 10.1021/Ac202450g. 3. J. M. Halket, A. Przyborowska, S. E. Stein, W. G. Mallard, S. Down, R. A. Chalmers, Deconvolution gas chromatography mass spectrometry of urinary organic acids - Potential for pattern recognition and automated identification of metabolic disorders. Rapid Commun Mass Sp 1999, 13. 279-284, DOI: Doi 10.1002/(Sici)1097-0231(19990228)13:43.0.Co;2-I. 4. Y. M. Tikunov, S. Laptenok, R. D. Hall, A. Bovy, R. C. de Vos, MSClust: a tool for unsupervised mass spectra extraction of chromatography-mass spectrometry ion-wise aligned data. Metabolomics : Official journal of the Metabolomic Society 2012, 8. 714-718, DOI: 10.1007/s11306-011-0368-2. 5. R. S. Plumb, K. A. Johnson, P. Rainville, B. W. Smith, I. D. Wilson, J. M. Castro-Perez, J. K. Nicholson, UPLC/MSE; a new approach for generating molecular fragment information for biomarker structure elucidation (vol 20, pg 1989, 2006). Rapid Commun Mass Sp 2006, 20. 2234-2234, DOI: Doi 10.1002/Rcm.2602. 6. C. J. Broccardo, G. S. Hussey, L. Goehring, P. Lunn, J. E. Prenni, Proteomic Characterization of Equine Cerebrospinal Fluid. Journal of Equine Veterinary Science 2013. DOI: http://dx.doi.org/10.1016/j.jevs.2013.07.013. 7. C. A. Smith, E. J. Want, G. O'Maille, R. Abagyan, G. Siuzdak, XCMS: Processing mass spectrometry data for metabolite profiling using Nonlinear peak alignment, matching, and identification. Analytical chemistry 2006, 78. 779-787, DOI: Doi 10.1021/Ac051437y. 8. R. C. Team. Vienna, Austria, 2013. 9. (a) B. M. Bolstad, preprocessCore: A collection of pre-processing functions}. 2014, R package version 1.22.0; (b) L. Brodsky, A. Moussaieff, N. Shahaf, A. Aharoni, I. Rogachev, Evaluation of Peak Picking Quality in LC-MS Metabolomics Data. Analytical chemistry 2010, 82. 9177-9187, DOI: Doi 10.1021/Ac101216e. 10. D. Adler, C. Gläser, O. Nenadic, J. Oehlschlägel, W. Zucchini. R package version 2.2-11 edn., 2013.

ACS Paragon Plus Environment

Page 12 of 19

Page 13 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Analytical Chemistry

11. D. Mullner, fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python. Journal of Statistical Software 2013, 53. 1-18. 12. P. Langfelder, B. Zhang, S. Horvath, Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics 2008, 24. 719-20, DOI: 10.1093/bioinformatics/btm563. 13. H. Horai, M. Arita, S. Kanaya, Y. Nihei, T. Ikeda, K. Suwa, Y. Ojima, K. Tanaka, S. Tanaka, K. Aoshima, Y. Oda, Y. Kakazu, M. Kusano, T. Tohge, F. Matsuda, Y. Sawada, M. Y. Hirai, H. Nakanishi, K. Ikeda, N. Akimoto, T. Maoka, H. Takahashi, T. Ara, N. Sakurai, H. Suzuki, D. Shibata, S. Neumann, T. Iida, K. Tanaka, K. Funatsu, F. Matsuura, T. Soga, R. Taguchi, K. Saito, T. Nishioka, MassBank: a public repository for sharing mass spectral data for life sciences. Journal of Mass Spectrometry 2010, 45. 703714, DOI: Doi 10.1002/Jms.1777. 14. P. Langfelder, B. Zhang, S. Horvath, Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics 2008, 24. 719-720, DOI: DOI 10.1093/bioinformatics/btm563. 15. C. D. Broeckling, A. L. Heuberger, J. A. Prince, E. Ingelsson, J. E. Prenni, Assigning precursorproduct ion relationships in indiscriminant MS/MS data from non-targeted metabolite profiling studies. Metabolomics : Official journal of the Metabolomic Society 2013, 9. 33-43, DOI: DOI 10.1007/s11306012-0426-4.

18 19

Table 1: Comparison between ramclustR and CAMERA. MSMSsimilarity refers to the spectral similarity

20

between mapped feature for which data dependent MS/MS data were available and the reconstructed

21

spectra from the output dataset defined in the ‘method’ column. nClus (>1) refers to the number of

22

clusters with two or more features defined by the grouping method, and perSing is the percentage of all

23

features in the dataset that remain ungrouped (singletons). The comparisons were performed using

24

default values for both the CSF and Urine datasets. The first three rows in each CSF and URINE datasets

25

reflect CAMERA functions, while the final row reflects RAMClustR-based grouping.

Dataset CSF

Method xsb