Subscriber access provided by Swem Library | College of William and Mary & VIVA (Virtual Library of Virginia)
Technical Note
RAMClust: a novel feature clustering method enables spectral-matching based annotation for metabolomics data Corey David Broeckling, Fayyaz ul Amir Afsar Minhas, Steffen Neumann, Asa Ben-Hur, and Jessica E. Prenni Anal. Chem., Just Accepted Manuscript • Publication Date (Web): 13 Jun 2014 Downloaded from http://pubs.acs.org on June 16, 2014
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
Analytical Chemistry is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 19
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
1
RAMClust: a novel feature clustering method enables spectral-matching based annotation for
2
metabolomics data
3
Broeckling, C.D1; , Afsar, F. A2.; Neumann, S4.. Ben-Hur, A2, Prenni, J.E1,3.
4 5
Corey D. Broeckling: Proteomics and Metabolomics Facility, Colorado State University, Fort Collins, CO
6
80523. 970-491-2273.
[email protected] 7
Fayyaz ul Amir Afsar Minhas: Department of Computer Science, Colorado State University, Fort Collins,
8
CO 80523. 970-213-9093:
[email protected] 9
Steffen Neumann: Department of Stress- and Developmental Biology, Leibniz Institute of Plant
10
Biochemistry, 06108 Halle, Germany.
[email protected] 11
Asa Ben-Hur: Department of Computer Science, Colorado State University, Fort Collins, CO 80523. 970-
12
491-4068.
[email protected] 13
Jessica E. Prenni: Proteomics and Metabolomics Facility, Colorado State University, Fort Collins, CO
14
80523. 970-491-0961.
[email protected] 15 16
1. Proteomics and Metabolomics Facility, Colorado State University, Fort Collins, CO 80523
17
2. Department of Computer Science, Colorado State University, Fort Collins, CO 80523
18
3. Department of Biochemistry, Colorado State University, Fort Collins, CO 80523
19
4. Department of Stress- and Developmental Biology, Leibniz Institute of Plant Biochemistry, 06108
20
Halle, Germany
ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1
Abstract: Metabolomic data is frequently acquired using chromatographically coupled mass
2
spectrometry platforms. For such datasets, the first step in data analysis relies on feature detection,
3
where a feature is defined by a mass and retention time. While a feature typically is derived from a
4
single compound, a spectrum of mass signals is more a more accurate representation of the mass
5
spectrometric signal for a given metabolite. Here we report a novel feature grouping method which
6
operates in an unsupervised manner to group signals from MS data into spectra without relying on
7
predictability of the in-source phenomenon. We additionally address a fundamental bottleneck in
8
metabolomics, annotation of MS level signals, by incorporating indiscriminant MS/MS (idMS/MS) data
9
implicitly: feature detection is performed on both MS and indiscriminant MS/MS data, and feature-
10
feature relationships are determined simultaneously from the MS and idMS/MS data. This approach
11
facilitates identification of metabolites using in-source MS and/or idMS/MS spectra from a single
12
experiment, reduces quantitative analytical variation as compared to single feature measures, and
13
decreases false positive annotations of unpredictable phenomenon as novel compounds. This tool is
14
released as a freely available R package, called RAMClustR, and is sufficiently versatile to group features
15
from any chromatographic-spectrometric platform or feature finding software.
16
Introduction:
17
Mass spectrometry (MS) has long been utilized for detecting and quantifying small molecules,
18
particularly when coupled to separation tools such as gas chromatography (GC), liquid chromatography
19
(LC), or capillary electrophoresis (CE). The strengths of these chromatographically coupled mass
20
spectrometry platforms have been leveraged toward global metabolite profiling approaches, or
21
metabolomics. The development of electrospray ionization (ESI) 1 was an important technological
22
milestone which allowed for the coupling of liquid separation methods to mass spectrometers. This
23
development obviated the volatility requirement imposed by gas chromatography and supported
ACS Paragon Plus Environment
Page 2 of 19
Page 3 of 19
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
1
development and expansion of both metabolomics and proteomics. Electrospray is considered a ‘soft’
2
ionization technique, by which the molecular ion of the compound is generally more dominant than that
3
achieved using ‘hard’ ionization methods such as electron impact ionization (EI). However, the ESI
4
process is imperfectly ‘soft’ and does produce some degree of in-source fragmentation. Furthermore,
5
secondary adducts, multimers, and fragmentation products of these can form during the ionization
6
process resulting in multiple observed ions representative of a single compound. These redundant
7
signals are effectively utilized for EI spectra to allow for spectral matching based annotation metabolite
8
signals.
9
Data analysis workflows which seek to detect mass signals in a non-targeted manner utilize both mass
10
and retention time-based specificity – the resulting signal is commonly referred to as a ‘feature’. In the
11
absence of co-elution, one feature originates from a single compound. However, the reciprocal is
12
largely untrue: a single compound can give rise to multiple features, as described above. Therefore,
13
many metabolomics data processing tools, including both commercial and open source tools attempt to
14
group features into spectra . Some grouping strategies are based on chemically meaningful and
15
predictable patterns reflecting known phenomenon. However, this approach can be compromised by
16
1.) interfering signals from co-eluting metabolites in complex samples that happen to look like
17
fragments, adducts, or isotopes and 2.) unpredictable mass spectral fragments, adducts, or isotopes. As
18
such, an unsupervised approach to grouping features is an attractive alternative. Previous tools
19
including CAMERA2, AMDIS3, and MSClust4 have attempted to address this issue but none of these make
20
full use of the non-targeted data. For example, CAMERA is biased toward the most abundant features
21
and utilizes discrete binning by retention time. MSClust also looks for coeulting and covarying features
22
and ultimately selects a representative ‘centrotype’ feature for downstream statistical analysis – the
23
majority of features are discarded. AMDIS works on a single data file, is generally not used for
ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1
quantitation, and does not utilize high mass accuracy data. Further, all of these tools are designed for
2
single channel MS datasets.
3
Here we report the development of a novel metabolomics workflow constructed around indiscriminant
4
MS/MS data acquisition (idMS/MS), which employs high collision energy fragmentation without
5
precursor ion selection5, acquired concurrently with low collision energy MS data. Our method is based
6
on the premise that two features resulting from the same compound exhibit similarity in their retention
7
times and a high correlation in their abundance profiles across different samples within a dataset.
8
Based on this observation, we have developed a simple similarity function between features that allows
9
us to use hierarchical clustering to generate the spectra of chemical compounds by grouping features
10
from a single compound in a single cluster. Feature finding is conducted in both low and high collision
11
energy data, and a custom feature similarity score drives clustering of features into spectra suitable for
12
informed manual interpretation as well as automated database searching. This approach results in both
13
in-source MS and idMS/MS spectra for all detected features and enables spectral matching to public,
14
commercial, and custom spectral databases without additional experimentation.
15
Experimental:
16
Sample acquisition and preparation:
17
Equine CSF samples were obtained as previously described6. CSF was thawed at 4°C, and 100 µL of CSF
18
was precipitated with 400 µL of cold methanol. This solution was mixed thoroughly, incubated at -20°C
19
for 1 hr, and spun at 12,000 xg for 15 minutes to remove proteins. The supernatant was transferred to
20
autosampler vials for UPLC-MS analysis. The validation dataset consists of 50 urine samples, collected
21
from Swedish males. Samples were prepared by thawing the urine at 4°C, diluting with equal part
22
water, and centrifuging to remove particulates.
ACS Paragon Plus Environment
Page 4 of 19
Page 5 of 19
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
1
UPLC-MS data acquisition:
2
Metabolome analysis of CSF and urine samples were accomplished using a Waters Acquity UPLC coupled
3
to a Xevo G2 Q-TOF MS. Five µL of either protein depleted CSF or diluted urine was injected onto an HSS
4
T3 column (Waters, 1x100mm, 1.7µM), and eluted using a gradient of water to acetonitrile, each
5
containing 0.1% formic acid. The gradient held at 0.1% B for 1 minute, ramped to 95%B over 12 minutes
6
and held for 3 minutes before returning to 0.1%B and equilibrating for 3.9 minutes (20 minute run time).
7
The flow rate was held constant at 200 µl/minute. Eluent was ionized via positive mode electrospray
8
ionization, with capillary voltage set to 2.2 kV, cone to 30 V, extraction cone to 2, with source
9
temperature at 150°C and desolvation nitrogen gas set to 350°C at 800 L/hr. Before acquisition, the
10
instrument was calibrated via infusion of sodium formate to within 1 ppm error. Mass accuracy was
11
ensured via infusion of leucine enkaphalin lockmass, collected as a 0.5 second scan at 10 V collision
12
energy every 20 seconds. Sample data were acquired in MS^E mode, with alternating scans (0.2
13
seconds/scan, m/z 50-1200) collected at collision energy of 6 V (MS) or using a CE ramp from 15-30V
14
(idMS/MS). Each sample was injected in duplicate, with each set of injections completely randomized
15
for acquisition order. In addition, the samples were analyzed using data dependent acquisition mode
16
for traditional MS/MS experiments, with one DDA MS/MS spectrum acquired per MS scan, with a
17
minimum precursor intensity threshold of 200 counts per second. All data was acquired in centroid
18
mode.
19
Raw data conversion and processing:
20
Waters raw files were converted to cdf format using Databridge, which separates low collision energy
21
MS and high collision energy idMS/MS data into two separate cdf files. The lockmass function data was
22
discarded for this application. Feature detection (utilizing the centWave algorithm), an initial grouping
23
step using a wide bandwidth (3), retention time correction, regrouping using a narrow bandwidth (1.5),
ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1
and peak filling was performed using XCMS7 (v. 1.32.0) in R8 (v. 2.15). CAMERA2 (v. 1.16.0) was used a
2
benchmark comparison, utilizing default values.
3
RAMClust approach:
4
The RAMClust approach was developed in Matlab and is currently fully implemented in R in a package
5
called RAMClustR, and is currently available via github (https://github.com/cbroeckl/RAMClustR).
6
Implementation in R allowed an XCMS object to be used directly as input. The data within the XCMS
7
object were extracted using the XCMS groupval function and was normalized to the total XCMS
8
extracted ion signal (the quantile9 method is an available option in RAMclustR). When a second collision
9
energy level is used (as is possible with Waters MS^e5 datasets, utilized in this study), the user directs
10
delineation of MS and idMSMS datasets using a tag located within the filename or filepath of the xcms
11
object. ramclustR is also capable of accepting properly formatted data matrices from other peak
12
detection tools, with the only requirements being 1. No more than one sample (or file) name column
13
and one feature name row 2. Feature names which contain mass and retention time separated by a
14
constant delimiter and 3. features in columns and samples in rows. If both MS and idMS/MS data are to
15
be imported the features names must be identical between the two datasets.
16
RAMclust similarity was calculated for the full feature matrix (within a user specified maximum allowed
17
retention time window). Metabomolics datasets can generate thousands to tens of thousands of
18
features, which can tax the memory of many desktop computers. To manage memory, we utilize the ff
19
package10, which allows for rapid temporary storage of large R objects using physical disk space rather
20
than in memory, and process large data matrices in square blocks (2000 features at a time by default).
21
The RAMclust similarity scoring utilizes a guassian function, allowing flexibility in tuning correlational
22
and retention time similarity decay rates independently, based on the dataset and the acquisition
ACS Paragon Plus Environment
Page 6 of 19
Page 7 of 19
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
1
instrumentation. The correlational relationship between two features can be described by either MS-
2
MS, MS-idMS/MS, or idMS/MS-idMS/MS values, and we use pearson’s correlation to calculate similarity: = max −1 − / 2 , −1 − / 2 , −1 − /
2 − − 2!
"
3
where / is the correlation coefficient between # and # (i and j representing the peak areas $ %
4
in each sample for any two features), and σt and σr represent sigma values for the retention time and
5
correlational r value, respectively. Similarities were then converted to dissimilarities & = 1 − for
6
clustering. The output similarity matrix was then clustered using average (for this study) or complete
7
linkage hierarchical clustering via that fastcluster11 package. The dendrogram was then cut using the
8
cutreeDynamicTree function in the package, dynamicTreeCut12. For this application, minimum module
9
size is set to 2, dictating that only clusters with two or more features are returned, as singletons are
10
impossible to interpret intelligently.
11
Cluster membership, in conjunction with the abundance values from individual features in the input
12
data, were used to create spectra. Mass was derived from the feature mass, and the abundance for
13
each mass in the spectrum was derived from the weighted mean of the intensity values for that feature.
14
These spectra were then exported as an msp formatted document, which can be directly imported by
15
NIST MSsearch, or used as input for MassBank13 or NIST msPepSearch
16
(http://peptide.nist.gov/software/ms_pep_search_gui/MSPepSearch.html) batch searching. Finally, the
17
cluster membership was then used to create a third dataset, SpecData, which represented the MS level
18
data after condensing feature intensities into spectral intensities using a weighted mean function, where
19
the more abundant signals contribute more to the spectral intensity.
20
Results and Discussion:
21
We developed and tested our approach using a UPLC-MS dataset of 38 samples of equine cerebrospinal
22
fluid, and subsequently validated the approach in an independent urine dataset (supplementary Figure
ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1
1). XCMS7 was used for feature finding, retention time correction, and alignment and the resulting
2
dataset was subsequently normalized to total XCMS signal intensity for each sample. The output data
3
was then divided into low collision energy (MS) and high collision energy (idMS/MS) datasets, each with
4
dimensions of row number equal to the number of injections columns equal to the number of features
5
(21060, for the CSF dataset). Each cell of these datasets represents signal intensity at either low (MS) or
6
high (idMS/MS) collision energy. We developed a custom similarity matrix which is the product of two
7
Gaussian terms, one that considers the differences in retention times between two features and second
8
that considers the correlation between two features across all samples in the dataset. These two terms
9
have widths defined by σt and σr, respectively. This captures our intuition that two features are similar
10
if they are close in retention time and are correlated: both are required for two features to be grouped.
11
Following the computation of the similarity matrix, features are clustered using hierarchical clustering.
12
To generate discrete clusters from the resulting hierarchical clustering dendrogram, we then used the
13
DynamicTreeCut14 package in R. Cluster membership of each feature provides qualitative spectral
14
membership information, and the quantitative data is taken from the MS and idMS/MS datasets;
15
abundance values are calculated as the averaged signal intensity for each feature separately in both the
16
low and high collision energy datasets. Thus, for each cluster two spectra are recreated corresponding
17
to the low collision energy in-source spectra and the high collision energy counterparts.
18
Any feature clustering tool must demonstrate accuracy to be useful in reducing redundancy without
19
reducing biological coverage. One option to accomplish this is to compare the results of the clustering
20
to a small panel of known compounds that are spiked into a sample. While this is a valid approach, it
21
relies on the assumption that the chosen panel of compounds is representative of all the metabolites in
22
a complex biological matrix. Thus, to increase the breadth of our validation experiments we instead
23
assessed the accuracy of the clustering by comparison against MS/MS spectra acquired using a
ACS Paragon Plus Environment
Page 8 of 19
Page 9 of 19
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
1
traditional dependent acquisition (DDA) approach from the same CSF samples. All precursor ions that 1.)
2
could be mapped to a feature in the output dataset and 2.) contained greater than 10 product ions
3
were used as ‘valid’ spectra for comparison. These spectra represented known precursor-product ion
4
relationships from many of the major signals in the dataset, even if the identity of the compounds was
5
unknown. The spectra created by RAMClust were then compared to the DDA spectra and the dot
6
product spectral similarity score calculated as a measure of accuracy, as described previously15. While
7
the complexity of in-source and indiscriminant MS/MS signals is expected to be higher than DDA MS/MS
8
spectra for the same compound, more accurate clustering will still be revealed as relatively higher dot-
9
product similarity scores between the RAMclustR reconstructed spectra and the mapped DDA MS/MS
10
spectrum.
11
The RAMClust algorithm has several parameters that can be tuned by the user to improve clustering
12
accuracy. σt and σr represent guassian tuning parameters of retention time similarity and correlational
13
score, respectively, between feature pairs. The influence of these two parameters on the similarity is
14
depicted in Figure 1. These tuning parameters will allow the algorithm to be used with MS data from
15
any chromatographic platform. When idMS/MS data are available, correlational similarity can be
16
calculated between two features either at the level of either MS vs. MS, MS vs. idMS/MS, or idMS/MS
17
vs. idMS/MS. While the MS-idMSMS correlation theoretically represents the CID event most directly,
18
this relationship is subject to potential interfering signals in both data channels (MS and idMS/MS). In
19
practice, a strong correlational relationship at any of the three levels represents strong evidence of
20
precursor product relationships, thus the algorithm utilized the maximum correlational r-value of the
21
three relationships.
22
The influence of σt and σr on the average spectral similarity between RAMClust and DDA spectra was
23
rigorously evaluated at 441 combinations of parameter levels of σt and σr (figure 2a). These results
ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1
revealed a plateau of high spectral similarity at values of σt =2 and σr = 0.5 (Figure 2a). This σt value was
2
approximately half the median peak width of the XCMS detected peak (max-min time for each individual
3
peak in the xcms object), indicating that we can directly use XCMS input to set this parameter without
4
user intervention – this holds true for an independent dataset of urine samples (supplementary data).
5
Correlation is a scale free statistic, and should be platform neutral, thus we used our observed optimal
6
value of 0.5 and can expect reasonable performance on any platform. Implementation of RAMclustR
7
using parameters which maximized MS/MS similarity between reconstructed spectra and DDA spectra
8
generated approximately 2500 clusters with at least two features (Figure 2b), and relatively few
9
singletons (Figure 2c). This algorithm generated a large stable region, indicating that it is robust to small
10
changes in parameter values. This stability generated strong MS/MS similarity even at ‘unreasonable’ σt
11
values (>200 seconds), so long as σr is proportionally high (Figure 2a). We interpret this as a scaling
12
phenomenon, as the dynamicTreeCut algorithm is responsive to tree ‘shape’ rather than an absolute
13
height14. The dynamicTreeCut maximum height parameter was also examined in conjunction with σt,
14
and revealed that the tree pruning step benefited from some precutting (Figure 2d), thus we employ a
15
default value of 0.3 for this parameter. These parameterization rules make the algorithm extremely
16
easy to use: when an XCMS object is used as input, the user needs to set none of these parameters, and
17
when a dataset is imported from other software only σt needs to be manually set. The output MS/MS
18
similarity using default RAMclust similarity scores was used to compare results against the only other
19
feature grouping tool in R, CAMERA. The results of this comparison indicated that RAMclust grouping of
20
features resulted in spectra that are more similar to DDA spectra than the results generated from
21
CAMERA’s groupFWHM, groupCorr, and groupDen functions (Table 1). This observation was validated
22
on a second LC-MS dataset of urine samples: RAMClust grouping resulted in clustering output that
23
better represents valid feature relationships, and consequentially, biological small molecule signals.
ACS Paragon Plus Environment
Page 10 of 19
Page 11 of 19
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
1
The spectra produced via RAMclust grouping can written to NIST MSP format for viewing and searching,
2
and can be submitted directly to the MassBank Database13 batch search tool, submitted for batch
3
searching to NIST msPepSearch, and/or viewed and searched via the NIST MSSearch program. All these
4
tools offer the ability to generate and search against custom libraries of spectra, and our lab is creating
5
libraries of in-source spectra toward this end. However, idMS/MS spectra recreated from the RAMClust
6
algorithm and workflow were highly similar to authentic NIST MS/MS database spectra (Figure 3a-c),
7
demonstrating that this workflow can take full advantage of existing resources.
8
Since RAMClust generated spectra accurately reflect spectra of authentic chemical standards, the
9
intensity of the spectra themselves can be used as the quantitative unit for downstream statistical
10
analysis. The intensity of the spectra were calculated using a weighted mean function of all the
11
component features, such that each value in the resulting dataset represents the quantitative signal
12
intensity value for each spectrum for each sample in the dataset. The use of spectra dramatically
13
reduced analytical variation through an averaging of measurement noise, as compared to either the
14
mean or median feature-based variation for each cluster (Figure 3d).
15
Conclusions:
16
Annotation of mass signals in non-targeted metabolomics experiments remains a significant bottleneck
17
and is arguably one of the most important challenges to the field as confident metabolite identification
18
is required for biological interpretation. In this report, we demonstrate a novel workflow utilizing
19
indiscriminant MS/MS data acquisition, expanded feature finding and a novel clustering algorithm to
20
group features based on both low and high collision energy data to generate spectra that are compatible
21
with publically available spectral search tools. The workflow allows for more efficient use of
22
instrumentation, reduced feature redundancy and false discovery rate correction burden for
23
downstream univariate statistical tests, improved analytical reproducibility, a more automated
ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1
annotation workflow, and greatly increased confidence in the annotations as compared to accurate
2
mass-based searching alone. RAMClustR is available for download at
3
https://github.com/cbroeckl/RAMClustR.
4 5
Acknowledgements: Afsar, F. A was funded by the Fulbright Scholarship Program of the U.S.
6
Department of State and the Higher Education Commission of the Government of Pakistan.
7
References
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
1. C. M. Whitehouse, R. N. Dreyer, M. Yamashita, J. B. Fenn, Electrospray Interface for Liquid Chromatographs and Mass Spectrometers. Analytical chemistry 1985, 57. 675-679, DOI: Doi 10.1021/Ac00280a023. 2. C. Kuhl, R. Tautenhahn, C. Böttcher, T. R. Larson, S. Neumann, CAMERA: An Integrated Strategy for Compound Spectra Extraction and Annotation of Liquid Chromatography/Mass Spectrometry Data Sets. Analytical chemistry 2012, 84. 283-289, DOI: Doi 10.1021/Ac202450g. 3. J. M. Halket, A. Przyborowska, S. E. Stein, W. G. Mallard, S. Down, R. A. Chalmers, Deconvolution gas chromatography mass spectrometry of urinary organic acids - Potential for pattern recognition and automated identification of metabolic disorders. Rapid Commun Mass Sp 1999, 13. 279-284, DOI: Doi 10.1002/(Sici)1097-0231(19990228)13:43.0.Co;2-I. 4. Y. M. Tikunov, S. Laptenok, R. D. Hall, A. Bovy, R. C. de Vos, MSClust: a tool for unsupervised mass spectra extraction of chromatography-mass spectrometry ion-wise aligned data. Metabolomics : Official journal of the Metabolomic Society 2012, 8. 714-718, DOI: 10.1007/s11306-011-0368-2. 5. R. S. Plumb, K. A. Johnson, P. Rainville, B. W. Smith, I. D. Wilson, J. M. Castro-Perez, J. K. Nicholson, UPLC/MSE; a new approach for generating molecular fragment information for biomarker structure elucidation (vol 20, pg 1989, 2006). Rapid Commun Mass Sp 2006, 20. 2234-2234, DOI: Doi 10.1002/Rcm.2602. 6. C. J. Broccardo, G. S. Hussey, L. Goehring, P. Lunn, J. E. Prenni, Proteomic Characterization of Equine Cerebrospinal Fluid. Journal of Equine Veterinary Science 2013. DOI: http://dx.doi.org/10.1016/j.jevs.2013.07.013. 7. C. A. Smith, E. J. Want, G. O'Maille, R. Abagyan, G. Siuzdak, XCMS: Processing mass spectrometry data for metabolite profiling using Nonlinear peak alignment, matching, and identification. Analytical chemistry 2006, 78. 779-787, DOI: Doi 10.1021/Ac051437y. 8. R. C. Team. Vienna, Austria, 2013. 9. (a) B. M. Bolstad, preprocessCore: A collection of pre-processing functions}. 2014, R package version 1.22.0; (b) L. Brodsky, A. Moussaieff, N. Shahaf, A. Aharoni, I. Rogachev, Evaluation of Peak Picking Quality in LC-MS Metabolomics Data. Analytical chemistry 2010, 82. 9177-9187, DOI: Doi 10.1021/Ac101216e. 10. D. Adler, C. Gläser, O. Nenadic, J. Oehlschlägel, W. Zucchini. R package version 2.2-11 edn., 2013.
ACS Paragon Plus Environment
Page 12 of 19
Page 13 of 19
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Analytical Chemistry
11. D. Mullner, fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python. Journal of Statistical Software 2013, 53. 1-18. 12. P. Langfelder, B. Zhang, S. Horvath, Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics 2008, 24. 719-20, DOI: 10.1093/bioinformatics/btm563. 13. H. Horai, M. Arita, S. Kanaya, Y. Nihei, T. Ikeda, K. Suwa, Y. Ojima, K. Tanaka, S. Tanaka, K. Aoshima, Y. Oda, Y. Kakazu, M. Kusano, T. Tohge, F. Matsuda, Y. Sawada, M. Y. Hirai, H. Nakanishi, K. Ikeda, N. Akimoto, T. Maoka, H. Takahashi, T. Ara, N. Sakurai, H. Suzuki, D. Shibata, S. Neumann, T. Iida, K. Tanaka, K. Funatsu, F. Matsuura, T. Soga, R. Taguchi, K. Saito, T. Nishioka, MassBank: a public repository for sharing mass spectral data for life sciences. Journal of Mass Spectrometry 2010, 45. 703714, DOI: Doi 10.1002/Jms.1777. 14. P. Langfelder, B. Zhang, S. Horvath, Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics 2008, 24. 719-720, DOI: DOI 10.1093/bioinformatics/btm563. 15. C. D. Broeckling, A. L. Heuberger, J. A. Prince, E. Ingelsson, J. E. Prenni, Assigning precursorproduct ion relationships in indiscriminant MS/MS data from non-targeted metabolite profiling studies. Metabolomics : Official journal of the Metabolomic Society 2013, 9. 33-43, DOI: DOI 10.1007/s11306012-0426-4.
18 19
Table 1: Comparison between ramclustR and CAMERA. MSMSsimilarity refers to the spectral similarity
20
between mapped feature for which data dependent MS/MS data were available and the reconstructed
21
spectra from the output dataset defined in the ‘method’ column. nClus (>1) refers to the number of
22
clusters with two or more features defined by the grouping method, and perSing is the percentage of all
23
features in the dataset that remain ungrouped (singletons). The comparisons were performed using
24
default values for both the CSF and Urine datasets. The first three rows in each CSF and URINE datasets
25
reflect CAMERA functions, while the final row reflects RAMClustR-based grouping.
Dataset CSF
Method xsb