Subscriber access provided by UNIV OSNABRUECK
Article
Non-Targeted Screening of Food Matrices: Development of a Chemometric Software Strategy to Identify Unknowns in Liquid Chromatography-Mass Spectrometry Data Ann M. Knolhoff, Jerry A Zweigenbaum, and Timothy R Croley Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.5b04208 • Publication Date (Web): 03 Mar 2016 Downloaded from http://pubs.acs.org on March 6, 2016
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
Analytical Chemistry is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Analytical Chemistry
Non-Targeted Screening of Food Matrices: Development of a Chemometric Software Strategy to Identify Unknowns in Liquid Chromatography-Mass Spectrometry Data
Ann M. Knolhoff1*, Jerry A. Zweigenbaum2, Timothy R. Croley1 1
U.S. Food and Drug Administration, Center for Food Safety and Applied Nutrition, 5100 Paint Branch Parkway, College Park, MD 20740 2
Agilent Technologies, Inc., 2850 Centerville Road, Wilmington, DE, 19808
*Corresponding Author Ann M. Knolhoff Tel: +1 240-402-2917; Fax: +301-436-2624 E-mail address:
[email protected] 1 ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 2 of 27
19
Abstract
20
The ability to identify contaminants or adulterants in diverse, complex sample matrices is
21
necessary in food safety. Thus, non-targeted screening approaches must be implemented to
22
detect and identify unexpected, unknown hazardous compounds that may be present. Molecular
23
formulae can be generated for detected compounds from high-resolution mass spectrometry data,
24
but analysis can be lengthy when thousands of compounds are detected in a single sample.
25
Efficient data mining methods to analyze these complex data sets are necessary, given the
26
inherent chemical diversity and variability of food matrices. The aim of this work is to
27
determine necessary requirements to successfully apply data analysis strategies to distinguish
28
suspect and control samples. Infant formula and orange juice samples were analyzed with one
29
lot of each matrix containing varying concentrations of a four compound mixture to represent a
30
suspect sample set. Small molecular differences were parsed from the data, where analytes as
31
low as 10 ppb were revealed. This was accomplished, in part, by analyzing a quality control
32
standard, matrix spiked with an analytical standard mixture, technical replicates, a representative
33
number of sample lots, and blanks within the sample sequence; this enabled the development of a
34
data analysis workflow and ensured that the employed method is sufficient for mining relevant
35
molecular features from the data.
36 37
Keywords
38
non-targeted screening, unknown analysis, chemometrics, food safety, LC/MS, HR-MS
2 ACS Paragon Plus Environment
Page 3 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
39
Introduction
40
Chemical screening methodologies in food safety tend to focus on a specific list of compounds,
41
such as pesticides or toxins; however, this approach can be limiting because adulterants or
42
contaminants not included on a target list will not be identified. Developing approaches for non-
43
targeted analysis is necessary in food safety to identify new and emerging risks. An accurate,
44
high-throughput data analysis screening process for food matrices is needed, where the
45
methodology could be applied to different commodities and compound types. An advantage to
46
using liquid chromatography with mass spectrometry (LC/MS) is that thousands of compounds
47
can be screened within a single sample, which is particularly useful when analyzing complex
48
sample matrices. High-resolution mass spectrometry (HR-MS) enables sufficient mass accuracy
49
for chemical formulae generation and is well equipped to resolve compounds with similar
50
molecular weights.
51
A non-targeted LC/HR-MS data analysis strategy would be a multi-step process. First,
52
eluting compounds need to be determined and extracted from the data set. Next, interpretation of
53
the detected ions involves the assignment of the monoisotopic peak and its m/z value, isotopic
54
distribution analysis, and assigning any potential adducts or losses that may be associated with
55
the eluting compound. With sufficient mass accuracy and minimal isotopic distribution error,
56
the correct molecular formula can be generated for ions of interest.1,2 These molecular formulae
57
can then be searched against an established molecular database where many compounds can be
58
associated with a single molecular formula. Tandem mass spectrometry (MS/MS) can aid in
59
determining the identity of the compound by dissociating ions of interest. In addition, if the
60
compound is not present in a database, MS/MS may aid in predicting structural information.
3 ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
61
Page 4 of 27
Data mining to identify unknown analytes can be challenging because thousands of
62
compounds can be detected within a single food sample. One growing approach is foodomics
63
where the complete molecular content of a food matrix is characterized.3-6 Food databases are
64
also being generated7; building a database for a given commodity could be useful for commonly
65
screened or adulterated samples8 because these intrinsic compounds can be removed from further
66
analysis. However, many commodities require characterization and databases will need to be
67
updated when the molecular content of the commodity changes. Implementing a statistical
68
analysis complements these approaches by focusing on identifying compounds that are different
69
between sample sets, rather than identifying all compounds in a given sample type.
70
Statistical analyses, such as principal component analysis (PCA), are commonly used in
71
biological applications for determining chemical differences between two or more different
72
sample groups, such as control versus diseased states.9,10 This type of approach can be
73
challenging when analyzing food matrices due to inherent sample complexity and diversity.
74
There have been a few reports using statistical classification of food samples using LC/MS
75
data11, which include applications in adulteration,12 classifications based on region or type,13,14
76
and determining the presence of contaminants15. Developing these types of data processing
77
approaches will be beneficial in high-throughput screening of foods and other complex sample
78
matrices.
79
The goal of this work is to determine necessary requirements for a data analysis
80
workflow to distinguish molecular differences present in high, medium, and/or low abundance
81
between control and suspect food samples. Compounds covering a large concentration range
82
need to be parsed from the data to identify both highly abundant adulterants and low-level
83
contaminants. For example, how does chemical complexity and lot-to-lot differences affect
4 ACS Paragon Plus Environment
Page 5 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
84
differentiation? What kind of quality controls need to be implemented in this type of workflow?
85
Experimental design and data processing factors that impact final results are determined and
86
suggestions are provided for successful method development for statistically analyzing suspect
87
samples with an emphasis on identifying adulterants and contaminants for food safety.
88 89
Materials and Methods
90
Sample Preparation
91
Three lots of milk-based infant formula (I.F.) and seven lots of pulp-free 100% orange juice
92
(O.J.) were purchased from local grocery stores, each within the same brand of product but with
93
different lot numbers. Five samples were prepared for each lot; 2 g of powdered formula and 2
94
mL of O.J. were extracted in 10 mL acetonitrile in 15 mL polypropylene centrifuge tubes. These
95
were rotated on a roller mixer (Stuart, Bibby Scientific, Staffordshire, UK) for 1 h at 33 rpm,
96
centrifuged at 3900 rcf for 10 min, and filtered with 0.2 µm PTFE luer lock syringe filters
97
(Grace, Deerfield, IL, USA).
98 99
A 1 ng/µL mixture of colchicine, hydrocodone, ricinine, and yohimbine (Table S1; Cerilliant, Round Rock, TX, USA) was prepared in 90/10 (v/v) water/acetonitrile. This
100
analytical standard mixture was spiked into replicate sample extracts from one of the lots for
101
each sample matrix (lots 1 and 6 for I.F. and O.J., respectively) for an end concentration of 10,
102
100, and 500 ppb (i.e., 5 replicates for each concentration). This represents the suspect sample
103
groupings; samples from lots 1 and 6 were also prepared without the addition of the standard
104
mixture, referred to as the unspiked matrix group. The control group of samples included lots 2
105
and 3 for I.F. and lots 1-5 and 7 for O.J. The different comparisons of sample groups are listed
106
in Table S2, where the control group is compared against each spiked matrix concentration level,
5 ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
107
combined concentration levels, and against the unspiked matrix group. A 50 ppb standard
108
mixture was also prepared as a quality control (QC) sample, as was a blank (water injection),
109
which were analyzed periodically throughout the sample sequence. All solvents used were
110
Optima Grade (Thermo-Fisher Scientific, Pittsburg, PA, USA).
Page 6 of 27
111 112
Instrumentation
113
A 1290 Infinity LC was used in combination with a 6550 Q-TOF (Agilent Technologies, Santa
114
Clara, CA). A dual electrospray ionization (ESI) source was used with the following parameters:
115
150 ºC drying gas temperature, 19 L/min gas flow, 35 psig nebulizer, 350 ºC sheath gas
116
temperature and 12 L/min sheath gas flow. The ESI voltage was 3.5 kV, nozzle voltage was 0.5
117
KV, and the instrument operated in positive ion MS scan mode and monitored m/z 100-1000.
118
Reference masses were used to internally calibrate the data and included protonated purine (m/z
119
121.0509) and protonated hexakis (1H, 1H, 3H-tetrafluoropropoxy) phosphazine (m/z 922.0098).
120
The column was a Zorbax Eclipse Plus C18, 2.1x150 mm, 1.8 µm (Agilent Technologies). LC
121
conditions included a sample injection volume of 5 µL, 30 ºC column temperature, and 0.5
122
mL/min LC solvent flow rate with 0.1% formic acid (v/v) in water and acetonitrile, A and B,
123
respectively. The gradient was a 3 min hold at 95% A, 17 min linear gradient to 10% A, 5 min
124
hold at 10% A, and 5 min hold at 100% B.
125 126
Data Analysis
127
The developed data analysis workflow is displayed in Figure 1. MassHunter Qualitative
128
Analysis (Agilent, Version B.07.00) was used for the initial determination and interpretation of
129
eluting compounds by using “Find by Molecular Feature”. The data analysis method included:
6 ACS Paragon Plus Environment
Page 7 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
130
500 counts minimum ion peak height, compound filtering removed entities 4 out of 5 (80%)
195
of each of the spiked matrix concentrations (outlined in green in Table S4). The quality score is
196
calculated by an algorithm that considers the signal to noise, retention time consistency, peak
197
shape and width, isotopic pattern, and mass difference between ions and their specified adducts.
9 ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 10 of 27
198
The data analysis method was also examined for the O.J. QC samples, where the “no peak limit”
199
method was sufficient for extracting the analytes from the QC and spiked matrix samples. This
200
emphasizes the importance of data analysis method testing and optimization for each data set and
201
incorporating appropriate QC samples within the acquisition sequence.
202
We suggest analyzing not only QC replicates within a sample sequence, but also the
203
sample matrix of interest spiked with the QC mixture at low, medium, and high concentrations;
204
the analytical standard mixture should contain compounds within the retention time and mass
205
range of interest. This will be an appropriate data set to optimize or test data processing methods
206
and will ensure its suitability for the collected data. Given the large number of molecular
207
features that are detected in food matrices, it would be beneficial to increase the number of
208
compounds in the QC standard mixture to ensure sufficient performance of the feature extraction
209
method. Incorporating appropriate QC samples is also crucial in a statistical analysis workflow
210
to minimize the effects of any experimental or instrumental variability that may be present due to
211
performance differences over time or a change in LC solvents. For example, if the chemical
212
background measured by the instrument changes, these molecular differences may erroneously
213
influence sample type differentiation. Changes in sensitivity and mass accuracy can also be
214
monitored with a QC sample and can indicate if different data processing settings are necessary.
215
Similarly, data processing methods may also need to be modified for different sample matrices;
216
this was the case for the I.F. and O.J. samples analyzed in this study. Furthermore, analyzing
217
replicates (n>5/condition) yields higher confidence in the capabilities of both the instrument
218
detection and data analysis processes because sample-to-sample variability can be monitored.
219 220
The extracted molecular features for I.F. and O.J. are displayed in Figure S1. While both food matrices are chemically complex, the O.J. contained more ions of high abundance than the
10 ACS Paragon Plus Environment
Page 11 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
221
I.F. The number of molecular features found in the control lots of I.F. and O.J. were
222
approximately 1200 and 10600, respectively. Manually mining this data for chemical
223
differences between sample groupings would be incredibly difficult and time consuming, as
224
would identifying every eluting compound in either of these individual sample types.
225 226
Data Filtering
227
Filtering the data is feasible when multiple replicates have been analyzed for each sample. It
228
also becomes necessary due to the inherent capabilities of molecular feature extraction and food
229
matrix complexity. For example, searching for all of the detectable compounds in a given
230
sample results in the extraction of many features which are present in only one sample. In the
231
example shown in Figure S2, over 2000 features were only detected in a single O.J. sample in all
232
of the replicates and lots analyzed. Thus, requiring a feature to be present in at least 2 out of the
233
total number of samples substantially reduces the number of features that will be considered. In
234
the all spike and control O.J. comparison, the total number of features found in the sample
235
groupings was reduced from 11516 to 9346.
236
Because the optimized data analysis method was successful in reproducibly extracting the
237
same features in 80% of the QC and spiked matrices (Table S4), features were required to be
238
present within 60% of samples within a given sample set to further filter the data. This more
239
conservative setting was used to accommodate any instances of inconsistent feature extraction,
240
which may be due to the software algorithm or sample matrix variability. In the spiked (all
241
concentrations) and control O.J. comparison, this reduced the number of features by nearly 4600.
242
This filtering ensures that only reproducibly detected and extracted features are retained for
243
subsequent analysis.
11 ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 12 of 27
244 245
Visualizing Molecular Differences Between Suspect and Control Groups and Between Sample
246
Lots
247
After data filtering, the remaining features were analyzed by PCA; the plots are displayed in
248
Figure 2. PCA enables visualization of molecular similarities and differences based on the
249
proximity of samples to one another. For both I.F. and O.J., differentiation of the combined
250
spiked matrices and the control can be observed. It was expected that the lowest concentration
251
of the spiked analytical standard, 10 ppb, would be the most difficult to distinguish between the
252
two sample groupings. The 10 ppb spiked I.F. and control groups are clearly distinguished from
253
one another; however, this is not obvious in the 10 ppb versus control O.J. comparison.
254
We also investigated the inherent molecular differences between lots within the same
255
brand of food matrix. Lot 1 and lot 6 from the I.F. and O.J. samples, respectively, were the lots
256
spiked with the analytical standard mixture. These lots were also analyzed without the addition
257
of the standard. By comparing these unadulterated lots to their respective control groups, any
258
inherent molecular differences of that lot can be determined and these features would
259
additionally contribute to any observed chemical differences between the spiked and control
260
comparisons. Interestingly, lot 1 (I.F.) versus the control group does show some separation in
261
the PCA plot; however, when the three lots are compared independently, this differentiation is
262
not observed. It is expected that I.F. should be fairly consistent between different lots within the
263
same brand. In contrast, the comparison of lot 6 (O.J.) and the control group do not clearly
264
differentiate from one another, similar to the 10 ppb and control comparison for O.J. However,
265
in the individual lot comparison of O.J., lots 7 and 6 cluster separately from the others, where lot
266
6 is grouped between lot 7 and the remaining lots, indicating that it shares qualities of both
12 ACS Paragon Plus Environment
Page 13 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
267
sample groupings. This is likely why the 10 ppb samples do not appear to be in their own
268
grouping in that comparison. Interestingly, the O.J. lots display distinguishable lot-to-lot
269
variability, despite being purchased from the same brand and on the same date. This is not
270
surprising given that the molecular content may be influenced by weather conditions, species of
271
orange, and/or growth location, amongst other variables. Additionally analyzing data in this
272
manner can also serve to identify any samples or sample lots that may be outliers, which can be
273
excluded in further analyses.
274 275
Statistical Analysis: T-Test and Fold Change
276
An unpaired t-test was performed for each of the molecular features after data filtering. Multiple
277
comparisons of the control (unadulterated food matrix extracts) and suspect groups (lots with
278
analytical standard) included the combined spiked matrices, 10, 100, and 500 ppb spikes,
279
respectively (listed in Table S2). Additionally, unadulterated lots 1 and lot 6, respectively from
280
I.F. and O.J., were compared against their matrix control groups. As expected, the number of
281
molecular features found to be statistically different between sample groupings decrease with a
282
decrease in p-value threshold (FDR level), which is illustrated in Figure S3A. However, if a
283
large number of features are found to be statistically significant between sample groupings, a
284
certain number will be expected by chance; this is reflected in Figure S3B. For example,
285
because a large number (>450) of features was found to be statistically different between the 500
286
ppb and control O.J. groups, approximately 25 of these features would be expected by chance
287
with an adjusted p-value of 0.05. To reduce the number of potential molecular features, the p-
288
value threshold for each comparison was chosen where zero features are expected by chance,
289
which is automatically calculated within MPP (Table S3). Alternatively, a smaller p-value
13 ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 14 of 27
290
threshold (e.g., 0.001) could be consistently chosen when a large number of distinguishing
291
features are found, but at the risk of removing compounds of concern from the data set. Here, a
292
judgment call needs to be made to determine an appropriate threshold based on the number of
293
potential molecular feature candidates that differentiate sample groupings.
294
Molecular features were then limited to ones that exhibited a 2-fold increase compared to
295
the control group. The total number of features shown to initially distinguish the spike and
296
control groups is displayed in Figure 3 (indicated in blue). This specific compound list was then
297
searched again in the raw data files. Finding the presence of specific compounds is generally
298
more definitive compared to the initial feature extraction because it searches the same m/z value
299
within a specified retention time window for all data files rather than generically searching for
300
eluting compounds. Because of this, the repeat filtering step was chosen to be more restrictive
301
than the initial filtering, where 80% of features needed to be observed in a sample grouping to be
302
included in the subsequent statistical comparison.
303
The statistical analysis was then repeated on the extracted molecular features using this
304
specified search. The features that distinguish sample groupings decrease after the repeat
305
analysis, demonstrating that this recursive data treatment was necessary (indicated by red, Figure
306
3). This was likely due to features that were not reproducibly detected and/or extracted from the
307
data set. An automated recursion process would remove some of these manual processing steps,
308
which some vendors are beginning to develop (e.g., Agilent’s Profinder). In all comparisons of
309
the control and standard spike groupings, including the 10 ppb and control comparison, the four
310
compounds in the analytical standard mixture were able to be parsed from the data, which
311
demonstrates the utility of this developed data analysis workflow.
14 ACS Paragon Plus Environment
Page 15 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
312
Analytical Chemistry
Comparing the spiked matrices to a control group ensured the statistical data analysis
313
methodology was sufficient for different concentrations of compounds present in two very
314
different sample matrices. If spiked matrices are incorporated as a QC step, this data can also be
315
used to test and optimize the statistical data analysis workflow. Different processing parameters
316
or statistical tests may need to be used with different sample matrices or when greater molecular
317
differences are expected due to matrix complexity or variability within a matrix type.
318
In the I.F. comparisons, most of the t-test sets resulted in approximately 20 molecular
319
features, which is a manageable number to attempt to identify (indicated in red, Figure 3).
320
However, in the 500 ppb I.F. comparison, greater than 60 distinguishing compounds were found.
321
Limiting the number of compounds of interest is essential because identification for even a single
322
compound can be lengthy, but this must be done without removing potentially harmful
323
compounds from the data set. In the lot 1 versus control I.F. comparison, less than 10
324
compounds were found to distinguish these sample groups, which means it is likely that many of
325
the features detected in the other comparisons are associated with the standard mixture. This is
326
also indicated by an increase in the number of features with the amount of standard added
327
(Figure 3).
328
This same trend can be observed in the O.J. sample comparisons where 100 compounds
329
are found to distinguish the 500 ppb level from the control (indicated in red, Figure 3). The
330
resultant feature lists for the 500 ppb spike and control group comparisons for both I.F. and O.J.
331
were evaluated against features found in the blanks and in their respective unspiked lots (lots 1
332
and 6 for I.F. and O.J., respectively). None of the molecular features were found in the blank.
333
There was no overlap for I.F., while 3 of the 99 features were intrinsic to lot 6 (Figure S4).
334
Therefore, the majority of features are associated with the addition of the standard mixture.
15 ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 16 of 27
335
Thus, related molecular features can be removed if they coelute with another compound that
336
exhibited greater ion intensity. An example of this process is shown in Table S5. Compounds
337
within a 0.02 min retention time window were manually binned together, with the most abundant
338
compound being submitted for further analysis. As indicated in green in Figure 3, this decreases
339
the number of molecular features by more than half. While this optional strategy may eliminate
340
features of interest, it was employed here to reduce analysis time.
341 342
Molecular Formulae Generation and Database Searching
343
Molecular formulae were generated for statistically different compounds between the control and
344
suspect comparisons. The correct molecular formulae were generated for the compounds in the
345
standard mixture and many of these formulae were the top scoring result; however, there were
346
examples that were not. In the 10, 100, and 500 ppb comparisons in I.F., the molecular formula
347
of colchicine was not the top formula match for that detected compound. In the 500 ppb
348
comparison, there were three molecular formulae generated, each with scores greater than 97 out
349
of a scale of 100. The molecular formula score evaluates metrics including the signal-to-noise,
350
retention time, chromatographic peak width and shape, isotope pattern, and the mass difference
351
amongst related ions such as dimers, trimers, and other adducts. The score reflects the
352
probability that the feature is a real compound, with a score of 100 being a perfect fit. It is not
353
obvious why the correct formula was not the top hit because the measured mass accuracy
354
compared to the colchicine molecular formula is actually better than the top scoring formula. The
355
relative isotopic distributions for both molecular formulae are also similar. For the O.J.
356
comparisons, an incorrect molecular formula was generated as the top match instead of the
357
formula for ricinine in the all spike, 100 ppb, and 500 ppb comparisons. In the 500 ppb O.J.
16 ACS Paragon Plus Environment
Page 17 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
358
comparison example, the score for the generated molecular formula of ricinine was 97.6, but the
359
mass accuracy error compared against this compound was worse than the value reported for the
360
top matching molecular formula (3.3 ppm). This example emphasizes the need for good data
361
quality; high mass accuracy and accurately measured isotopic ratios can aid in generating the
362
correct molecular formula.2 Furthermore, q-TOF instruments can be susceptible to higher mass
363
accuracy error at high peak abundance due to saturation;18 however, the m/z value reported was
364
taken below the 30% saturation level (MassHunter indicates saturated compounds) from the peak
365
shoulders to minimize this effect. Because the best generated molecular formula may not
366
correspond to the detected compound, multiple molecular formulae may need to be considered,
367
where a cutoff score could be implemented.
368
There are multiple small molecule compound databases that are continuously being
369
created or updated. These databases are often not all encompassing, so multiple databases
370
should be searched. Some vendor software can automatically search against certain molecular
371
databases, but the capability to automatically link to multiple databases within the chosen
372
analysis software rather than manually searching available databases would be beneficial. The
373
reduced molecular formulae list generated by comparing 10 ppb and control O.J. groups was
374
manually searched against four commonly used online databases (Table 1). The number of
375
compounds that match a given molecular formula widely vary, which emphasizes the importance
376
of a manageable number of compounds to identify. If the suspect compound has been studied
377
previously, prioritizing the compounds by the number of references associated with it can be
378
useful and can be accomplished within ChemSpider.19 However, if the compound is an
379
emerging risk, it may not be well characterized. SciFinder also allows for prioritizing
380
compounds, where the list of compounds can be reduced by a variety of properties including
17 ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 18 of 27
381
toxicity. This could be particularly useful if the molecular species is known to cause an illness.
382
However, if the compound is truly an unknown, searching against a molecular database will be
383
insufficient for identification and MS/MS approaches will be required for structure elucidation.
384
Likewise, MS/MS analysis would be necessary to confirm a database assignment and to aid in
385
identification when no molecular formula is generated, as is the case for two compounds in Table
386
S5.
387
Searching the Metlin database revealed two metabolites of colchicine (gloriosine and
388
desmethylcolchicine) that match molecular formula C21H23NO6 (m/z 385) listed in Table 1;
389
these two compounds were also present as impurities in the standard solution. One of the
390
compounds matching this molecular formula was removed from the list in Table S5 because it
391
coeluted with colchicine. The MS/MS fragmentation listed in the Metlin database indicates that
392
this is not a product ion of colchicine, which implies that the similar chemical structure of this
393
metabolite causes its coelution. Furthermore, extracted ion chromatograms of colchicine (m/z
394
400) and its putative metabolite (m/z 385) do not completely overlay, which further supports that
395
m/z 386 is not a product ion of colchine (Figure S5). The two metabolites of colchicine were
396
found in all of the comparisons for both I.F. and O.J., which is additional confirmation of the
397
data analysis workflow functionality.
398 399
Conclusions
400
Validating method performance is critical in ensuring that hazardous compounds will be parsed
401
from the data set. As mentioned previously, analyzing both QC standards and QC matrix
402
samples is beneficial in determining instrumental platform and data analysis performance. Data
403
processing settings are not universal for all matrices and data sets. QC matrix samples provide
18 ACS Paragon Plus Environment
Page 19 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
404
the ability to adjust those settings, including those for feature extraction, so that true unknowns
405
have the highest probability of being determined in complex food samples. Additionally, QC
406
samples can be used to determine if the chromatographic method resolves eluting peaks and
407
results in consistent retention times and mass accuracy, which will enable identical compounds
408
detected in multiple analyses to be properly binned. Irregular chromatographic peak shapes will
409
also make feature comparison more challenging. If coelution or poor peak shapes affect the
410
feature extraction of QC compounds, the chromatography should be optimized. It is also worth
411
noting that sufficient sample preparation, in addition to chromatography, is also required to
412
ensure that compounds are detected from samples of interest, although this was outside the scope
413
of this work.
414
Analyzing blanks within the sample sequence is also necessary. This was not critical in
415
the present study because the samples were analyzed by the instrument within the same week. If
416
collected data needs to be compared with data analyzed in previous months or years, the
417
chemical background of the instrument platform may not be identical; this can also occur on a
418
much smaller timescale (few days) and may lead to false positives. The collected data from
419
blank injections can be used to ensure that differentiating compounds are not from the chemical
420
background of the system. Similarly, analyzing the data in random order in the acquisition
421
sequence will reduce this potential source of error. Incorporating an internal standard into each
422
of the samples can also serve as a normalization factor to account for any instrumental
423
performance differences.
424
Sample matrices where low lot-to-lot variability or a lower number of molecular features
425
are expected will be easier to compare, especially between months, and even years. Despite the
426
O.J. being purchased on the same date and from the same brand, some of the lots were different
19 ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 20 of 27
427
than the others (Figure 2). This emphasizes the need to analyze a sufficiently large number of
428
replicates and lots for a particular sample type to ensure a representative sample grouping. It is
429
expected that the molecular profile of O.J. could vary depending on the brand and blend, where it
430
is grown, the types of oranges used, and the climate. If there are inherent differences to the
431
samples being analyzed, there will be a larger number of statistically relevant molecular features
432
in addition to any adulterants/contaminants present, so additionally identifying these compounds
433
will increase analysis time. However, this is still an improvement compared to identifying all
434
compounds within a given sample. For sample types where a larger intralot variability is
435
expected within a sample type, a database for compounds that are common could be generated,
436
which could also reduce analysis time. Of course, these samples would need to be void of any
437
potential hazards.
438
While further advancement is needed to improve high-throughput identification, current
439
software tools are sufficient to detect molecular differences in food matrices, in spite of chemical
440
complexity. Statistical comparisons can be successful if appropriate quality controls are
441
implemented and if adequate sampling accounts for potential molecular variation within the
442
same sample type. This developed data analysis workflow can be used as a model for statistical
443
elucidation of compounds present in suspect food samples.
444 445
Abbreviations Used
446
LC/MS liquid chromatography coupled to mass spectrometry
447
HR-MS high-resolution mass spectrometry
448
MS/MS tandem mass spectrometry
449
I.F. infant formula
20 ACS Paragon Plus Environment
Page 21 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
450
O.J. orange juice
451
QC quality control
452
MPP Mass Profiler Professional
453
PCA Principal Component Analysis
454 455
Acknowledgments
456
The authors would like to thank John Ihrie (FDA) for insightful discussions concerning the
457
statistical treatment of the data.
458 459
Supporting Information
460
Supporting figures (Figures S1-S5) and tables (Table S1-S5) as noted in the text.
461 462
References
463
(1) Knolhoff, A. M.; Callahan, J. H.; Croley, T. R. J. Am. Soc. Mass Spectrom. 2014, 25, 1285-
464
1294.
465
(2) Kind, T.; Fiehn, O. BMC Bioinformatics 2007, 8, 105.
466
(3) Castro-Puyana, M.; Herrero, M. TrAC-Trend. Anal. Chem. 2013, 52, 74-87.
467
(4) García-Cañas, V.; Simó, C.; Herrero, M.; Ibáñez, E.; Cifuentes, A. Anal. Chem. 2012.
468
(5) Herrero, M.; Simó, C.; García-Cañas, V.; Ibáñez, E.; Cifuentes, A. Mass Spectrom. Rev.
469
2012, 31, 49-69.
470
(6) Hu, C.; Xu, G. TrAC-Trend. Anal. Chem. 2013, 52, 36-46.
471
(7) Scalbert, A.; Andres-Lacueva, C.; Arita, M.; Kroon, P.; Manach, C.; Urpi-Sarda, M.;
472
Wishart, D. J. Agr. Food Chem. 2011, 59, 4331-4348.
21 ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 22 of 27
473
(8) Moore, J. C.; Spink, J.; Lipp, M. J. Food Sci. 2012, 77, R118-R126.
474
(9) Pan, Z.; Gu, H.; Talaty, N.; Chen, H.; Shanaiah, N.; Hainline, B.; Cooks, R. G.; Raftery, D.
475
Anal. Bioanal. Chem. 2007, 387, 539-549.
476
(10) Wang, C.; Kong, H.; Guan, Y.; Yang, J.; Gu, J.; Yang, S.; Xu, G. Anal. Chem. 2005, 77,
477
4108-4116.
478
(11) Knolhoff, A. M.; Croley, T. R. J. Chrom. A, DOI:
479
http://dx.doi.org/10.1016/j.chroma.2015.08.059.
480
(12) Vaclavik, L.; Schreiber, A.; Lacina, O.; Cajka, T.; Hajslova, J. Metabolomics 2012, 8, 793-
481
803.
482
(13) Cotton, J.; Leroux, F.; Broudin, S.; Marie, M.; Corman, B.; Tabet, J.-C.; Ducruix, C.; Junot,
483
C. J. Agr. Food Chem. 2014, 62, 11335-11345.
484
(14) Vaclavik, L.; Lacina, O.; Hajslova, J.; Zweigenbaum, J. Anal. Chim. Acta 2011, 685, 45-51.
485
(15) Tengstrand, E.; Rosén, J.; Hellenäs, K.-E.; Åberg, K. M. Anal. Bioanal. Chem. 2013, 405,
486
1237-1243.
487
(16) Bolton, E., Wang, Y., Thiessen, P.A., Bryant, S.H. In Annu. Rep. Comput. Chem.; Elsevier:
488
Oxford, UK, 2008, pp 217-240.
489
(17) Smith, C. A.; O'Maille, G.; Want, E. J.; Qin, C.; Trauger, S. A.; Brandon, T. R.; Custodio,
490
D. E.; Abagyan, R.; Siuzdak, G. Ther. Drug Monit. 2005, 27, 747-751.
491
(18) Bristow, T.; Constantine, J.; Harrison, M.; Cavoit, F. Rapid Commun. Mass Spectrom. 2008,
492
22, 1213-1222.
493
(19) Little, J.; Williams, A.; Pshenichnov, A.; Tkachenko, V. J. Am. Soc. Mass Spectrom. 2011,
494
23, 179-185.
495
22 ACS Paragon Plus Environment
Page 23 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
Tables Table 1. Number of compounds found for generated molecular formulae for differentiating analytes from the 10 ppb spiked and control orange juice groups. The molecular formulae from the standard mixture are outlined in green.
Generated Molecular Formula C 8H 8N 2O 2 C15H30O2S2 C18H21NO3 C16H10N7O3 C17H25NO10 C21H26N2O3 C29H47N5OS4 C21H23NO6 C22H25NO6 C22H22O9
ChemSpider
SciFinder
PubChem
Metlin
515 1 3974 2 21 6984 0 1786 1721 114
1414 36 9675 0 129 11430 0 2414 2301 360
1 0 15 0 1 3 0 0 2 1
1 0 13 0 0 9 0 8 3 17
23 ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 24 of 27
Figure Graphics Figure 1. Data analysis workflow for distinguishing suspect and control groups. A. Overall data processing to aid in identifying unknown compounds using a statistical approach. B. Procedure implemented for data filtering and statistical analysis.
24 ACS Paragon Plus Environment
Page 25 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
Figure 2. PCA of selected comparisons of sample groupings. “Spiked vs Control” compares the unadulterated lots with all the matrices spiked with different concentrations of analytical standard, while “10 ppb vs Control” includes unadulterated lots and only the 10 ppb spiked matrices. “Lot 1 vs Control” and “Lot 6 vs Control” enabled any molecular differences inherent to Lot 1 or 6 to be monitored without the contribution of the analytical standard. The unadulterated lots are also compared in “Lot Comparison” to observe the molecular differences present between lots of the same brand of sample matrix.
25 ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 26 of 27
Figure 3. Number of molecular features found to differ between sample groupings for different stages of the data analysis process.
26 ACS Paragon Plus Environment
Page 27 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
Graphic for Table of Contents
27 ACS Paragon Plus Environment