Nontargeted Screening of Food Matrices: Development of a

Mar 3, 2016 - Nontargeted Screening of Food Matrices: Development of a. Chemometric Software Strategy To Identify Unknowns in Liquid. Chromatographyâˆ...
1 downloads 4 Views 425KB Size
Subscriber access provided by UNIV OSNABRUECK

Article

Non-Targeted Screening of Food Matrices: Development of a Chemometric Software Strategy to Identify Unknowns in Liquid Chromatography-Mass Spectrometry Data Ann M. Knolhoff, Jerry A Zweigenbaum, and Timothy R Croley Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.5b04208 • Publication Date (Web): 03 Mar 2016 Downloaded from http://pubs.acs.org on March 6, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Analytical Chemistry is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Analytical Chemistry

Non-Targeted Screening of Food Matrices: Development of a Chemometric Software Strategy to Identify Unknowns in Liquid Chromatography-Mass Spectrometry Data

Ann M. Knolhoff1*, Jerry A. Zweigenbaum2, Timothy R. Croley1 1

U.S. Food and Drug Administration, Center for Food Safety and Applied Nutrition, 5100 Paint Branch Parkway, College Park, MD 20740 2

Agilent Technologies, Inc., 2850 Centerville Road, Wilmington, DE, 19808

*Corresponding Author Ann M. Knolhoff Tel: +1 240-402-2917; Fax: +301-436-2624 E-mail address: [email protected]

1 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 27

19

Abstract

20

The ability to identify contaminants or adulterants in diverse, complex sample matrices is

21

necessary in food safety. Thus, non-targeted screening approaches must be implemented to

22

detect and identify unexpected, unknown hazardous compounds that may be present. Molecular

23

formulae can be generated for detected compounds from high-resolution mass spectrometry data,

24

but analysis can be lengthy when thousands of compounds are detected in a single sample.

25

Efficient data mining methods to analyze these complex data sets are necessary, given the

26

inherent chemical diversity and variability of food matrices. The aim of this work is to

27

determine necessary requirements to successfully apply data analysis strategies to distinguish

28

suspect and control samples. Infant formula and orange juice samples were analyzed with one

29

lot of each matrix containing varying concentrations of a four compound mixture to represent a

30

suspect sample set. Small molecular differences were parsed from the data, where analytes as

31

low as 10 ppb were revealed. This was accomplished, in part, by analyzing a quality control

32

standard, matrix spiked with an analytical standard mixture, technical replicates, a representative

33

number of sample lots, and blanks within the sample sequence; this enabled the development of a

34

data analysis workflow and ensured that the employed method is sufficient for mining relevant

35

molecular features from the data.

36 37

Keywords

38

non-targeted screening, unknown analysis, chemometrics, food safety, LC/MS, HR-MS

2 ACS Paragon Plus Environment

Page 3 of 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

39

Introduction

40

Chemical screening methodologies in food safety tend to focus on a specific list of compounds,

41

such as pesticides or toxins; however, this approach can be limiting because adulterants or

42

contaminants not included on a target list will not be identified. Developing approaches for non-

43

targeted analysis is necessary in food safety to identify new and emerging risks. An accurate,

44

high-throughput data analysis screening process for food matrices is needed, where the

45

methodology could be applied to different commodities and compound types. An advantage to

46

using liquid chromatography with mass spectrometry (LC/MS) is that thousands of compounds

47

can be screened within a single sample, which is particularly useful when analyzing complex

48

sample matrices. High-resolution mass spectrometry (HR-MS) enables sufficient mass accuracy

49

for chemical formulae generation and is well equipped to resolve compounds with similar

50

molecular weights.

51

A non-targeted LC/HR-MS data analysis strategy would be a multi-step process. First,

52

eluting compounds need to be determined and extracted from the data set. Next, interpretation of

53

the detected ions involves the assignment of the monoisotopic peak and its m/z value, isotopic

54

distribution analysis, and assigning any potential adducts or losses that may be associated with

55

the eluting compound. With sufficient mass accuracy and minimal isotopic distribution error,

56

the correct molecular formula can be generated for ions of interest.1,2 These molecular formulae

57

can then be searched against an established molecular database where many compounds can be

58

associated with a single molecular formula. Tandem mass spectrometry (MS/MS) can aid in

59

determining the identity of the compound by dissociating ions of interest. In addition, if the

60

compound is not present in a database, MS/MS may aid in predicting structural information.

3 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

61

Page 4 of 27

Data mining to identify unknown analytes can be challenging because thousands of

62

compounds can be detected within a single food sample. One growing approach is foodomics

63

where the complete molecular content of a food matrix is characterized.3-6 Food databases are

64

also being generated7; building a database for a given commodity could be useful for commonly

65

screened or adulterated samples8 because these intrinsic compounds can be removed from further

66

analysis. However, many commodities require characterization and databases will need to be

67

updated when the molecular content of the commodity changes. Implementing a statistical

68

analysis complements these approaches by focusing on identifying compounds that are different

69

between sample sets, rather than identifying all compounds in a given sample type.

70

Statistical analyses, such as principal component analysis (PCA), are commonly used in

71

biological applications for determining chemical differences between two or more different

72

sample groups, such as control versus diseased states.9,10 This type of approach can be

73

challenging when analyzing food matrices due to inherent sample complexity and diversity.

74

There have been a few reports using statistical classification of food samples using LC/MS

75

data11, which include applications in adulteration,12 classifications based on region or type,13,14

76

and determining the presence of contaminants15. Developing these types of data processing

77

approaches will be beneficial in high-throughput screening of foods and other complex sample

78

matrices.

79

The goal of this work is to determine necessary requirements for a data analysis

80

workflow to distinguish molecular differences present in high, medium, and/or low abundance

81

between control and suspect food samples. Compounds covering a large concentration range

82

need to be parsed from the data to identify both highly abundant adulterants and low-level

83

contaminants. For example, how does chemical complexity and lot-to-lot differences affect

4 ACS Paragon Plus Environment

Page 5 of 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

84

differentiation? What kind of quality controls need to be implemented in this type of workflow?

85

Experimental design and data processing factors that impact final results are determined and

86

suggestions are provided for successful method development for statistically analyzing suspect

87

samples with an emphasis on identifying adulterants and contaminants for food safety.

88 89

Materials and Methods

90

Sample Preparation

91

Three lots of milk-based infant formula (I.F.) and seven lots of pulp-free 100% orange juice

92

(O.J.) were purchased from local grocery stores, each within the same brand of product but with

93

different lot numbers. Five samples were prepared for each lot; 2 g of powdered formula and 2

94

mL of O.J. were extracted in 10 mL acetonitrile in 15 mL polypropylene centrifuge tubes. These

95

were rotated on a roller mixer (Stuart, Bibby Scientific, Staffordshire, UK) for 1 h at 33 rpm,

96

centrifuged at 3900 rcf for 10 min, and filtered with 0.2 µm PTFE luer lock syringe filters

97

(Grace, Deerfield, IL, USA).

98 99

A 1 ng/µL mixture of colchicine, hydrocodone, ricinine, and yohimbine (Table S1; Cerilliant, Round Rock, TX, USA) was prepared in 90/10 (v/v) water/acetonitrile. This

100

analytical standard mixture was spiked into replicate sample extracts from one of the lots for

101

each sample matrix (lots 1 and 6 for I.F. and O.J., respectively) for an end concentration of 10,

102

100, and 500 ppb (i.e., 5 replicates for each concentration). This represents the suspect sample

103

groupings; samples from lots 1 and 6 were also prepared without the addition of the standard

104

mixture, referred to as the unspiked matrix group. The control group of samples included lots 2

105

and 3 for I.F. and lots 1-5 and 7 for O.J. The different comparisons of sample groups are listed

106

in Table S2, where the control group is compared against each spiked matrix concentration level,

5 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

107

combined concentration levels, and against the unspiked matrix group. A 50 ppb standard

108

mixture was also prepared as a quality control (QC) sample, as was a blank (water injection),

109

which were analyzed periodically throughout the sample sequence. All solvents used were

110

Optima Grade (Thermo-Fisher Scientific, Pittsburg, PA, USA).

Page 6 of 27

111 112

Instrumentation

113

A 1290 Infinity LC was used in combination with a 6550 Q-TOF (Agilent Technologies, Santa

114

Clara, CA). A dual electrospray ionization (ESI) source was used with the following parameters:

115

150 ºC drying gas temperature, 19 L/min gas flow, 35 psig nebulizer, 350 ºC sheath gas

116

temperature and 12 L/min sheath gas flow. The ESI voltage was 3.5 kV, nozzle voltage was 0.5

117

KV, and the instrument operated in positive ion MS scan mode and monitored m/z 100-1000.

118

Reference masses were used to internally calibrate the data and included protonated purine (m/z

119

121.0509) and protonated hexakis (1H, 1H, 3H-tetrafluoropropoxy) phosphazine (m/z 922.0098).

120

The column was a Zorbax Eclipse Plus C18, 2.1x150 mm, 1.8 µm (Agilent Technologies). LC

121

conditions included a sample injection volume of 5 µL, 30 ºC column temperature, and 0.5

122

mL/min LC solvent flow rate with 0.1% formic acid (v/v) in water and acetonitrile, A and B,

123

respectively. The gradient was a 3 min hold at 95% A, 17 min linear gradient to 10% A, 5 min

124

hold at 10% A, and 5 min hold at 100% B.

125 126

Data Analysis

127

The developed data analysis workflow is displayed in Figure 1. MassHunter Qualitative

128

Analysis (Agilent, Version B.07.00) was used for the initial determination and interpretation of

129

eluting compounds by using “Find by Molecular Feature”. The data analysis method included:

6 ACS Paragon Plus Environment

Page 7 of 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

130

500 counts minimum ion peak height, compound filtering removed entities 4 out of 5 (80%)

195

of each of the spiked matrix concentrations (outlined in green in Table S4). The quality score is

196

calculated by an algorithm that considers the signal to noise, retention time consistency, peak

197

shape and width, isotopic pattern, and mass difference between ions and their specified adducts.

9 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 27

198

The data analysis method was also examined for the O.J. QC samples, where the “no peak limit”

199

method was sufficient for extracting the analytes from the QC and spiked matrix samples. This

200

emphasizes the importance of data analysis method testing and optimization for each data set and

201

incorporating appropriate QC samples within the acquisition sequence.

202

We suggest analyzing not only QC replicates within a sample sequence, but also the

203

sample matrix of interest spiked with the QC mixture at low, medium, and high concentrations;

204

the analytical standard mixture should contain compounds within the retention time and mass

205

range of interest. This will be an appropriate data set to optimize or test data processing methods

206

and will ensure its suitability for the collected data. Given the large number of molecular

207

features that are detected in food matrices, it would be beneficial to increase the number of

208

compounds in the QC standard mixture to ensure sufficient performance of the feature extraction

209

method. Incorporating appropriate QC samples is also crucial in a statistical analysis workflow

210

to minimize the effects of any experimental or instrumental variability that may be present due to

211

performance differences over time or a change in LC solvents. For example, if the chemical

212

background measured by the instrument changes, these molecular differences may erroneously

213

influence sample type differentiation. Changes in sensitivity and mass accuracy can also be

214

monitored with a QC sample and can indicate if different data processing settings are necessary.

215

Similarly, data processing methods may also need to be modified for different sample matrices;

216

this was the case for the I.F. and O.J. samples analyzed in this study. Furthermore, analyzing

217

replicates (n>5/condition) yields higher confidence in the capabilities of both the instrument

218

detection and data analysis processes because sample-to-sample variability can be monitored.

219 220

The extracted molecular features for I.F. and O.J. are displayed in Figure S1. While both food matrices are chemically complex, the O.J. contained more ions of high abundance than the

10 ACS Paragon Plus Environment

Page 11 of 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

221

I.F. The number of molecular features found in the control lots of I.F. and O.J. were

222

approximately 1200 and 10600, respectively. Manually mining this data for chemical

223

differences between sample groupings would be incredibly difficult and time consuming, as

224

would identifying every eluting compound in either of these individual sample types.

225 226

Data Filtering

227

Filtering the data is feasible when multiple replicates have been analyzed for each sample. It

228

also becomes necessary due to the inherent capabilities of molecular feature extraction and food

229

matrix complexity. For example, searching for all of the detectable compounds in a given

230

sample results in the extraction of many features which are present in only one sample. In the

231

example shown in Figure S2, over 2000 features were only detected in a single O.J. sample in all

232

of the replicates and lots analyzed. Thus, requiring a feature to be present in at least 2 out of the

233

total number of samples substantially reduces the number of features that will be considered. In

234

the all spike and control O.J. comparison, the total number of features found in the sample

235

groupings was reduced from 11516 to 9346.

236

Because the optimized data analysis method was successful in reproducibly extracting the

237

same features in 80% of the QC and spiked matrices (Table S4), features were required to be

238

present within 60% of samples within a given sample set to further filter the data. This more

239

conservative setting was used to accommodate any instances of inconsistent feature extraction,

240

which may be due to the software algorithm or sample matrix variability. In the spiked (all

241

concentrations) and control O.J. comparison, this reduced the number of features by nearly 4600.

242

This filtering ensures that only reproducibly detected and extracted features are retained for

243

subsequent analysis.

11 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 27

244 245

Visualizing Molecular Differences Between Suspect and Control Groups and Between Sample

246

Lots

247

After data filtering, the remaining features were analyzed by PCA; the plots are displayed in

248

Figure 2. PCA enables visualization of molecular similarities and differences based on the

249

proximity of samples to one another. For both I.F. and O.J., differentiation of the combined

250

spiked matrices and the control can be observed. It was expected that the lowest concentration

251

of the spiked analytical standard, 10 ppb, would be the most difficult to distinguish between the

252

two sample groupings. The 10 ppb spiked I.F. and control groups are clearly distinguished from

253

one another; however, this is not obvious in the 10 ppb versus control O.J. comparison.

254

We also investigated the inherent molecular differences between lots within the same

255

brand of food matrix. Lot 1 and lot 6 from the I.F. and O.J. samples, respectively, were the lots

256

spiked with the analytical standard mixture. These lots were also analyzed without the addition

257

of the standard. By comparing these unadulterated lots to their respective control groups, any

258

inherent molecular differences of that lot can be determined and these features would

259

additionally contribute to any observed chemical differences between the spiked and control

260

comparisons. Interestingly, lot 1 (I.F.) versus the control group does show some separation in

261

the PCA plot; however, when the three lots are compared independently, this differentiation is

262

not observed. It is expected that I.F. should be fairly consistent between different lots within the

263

same brand. In contrast, the comparison of lot 6 (O.J.) and the control group do not clearly

264

differentiate from one another, similar to the 10 ppb and control comparison for O.J. However,

265

in the individual lot comparison of O.J., lots 7 and 6 cluster separately from the others, where lot

266

6 is grouped between lot 7 and the remaining lots, indicating that it shares qualities of both

12 ACS Paragon Plus Environment

Page 13 of 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

267

sample groupings. This is likely why the 10 ppb samples do not appear to be in their own

268

grouping in that comparison. Interestingly, the O.J. lots display distinguishable lot-to-lot

269

variability, despite being purchased from the same brand and on the same date. This is not

270

surprising given that the molecular content may be influenced by weather conditions, species of

271

orange, and/or growth location, amongst other variables. Additionally analyzing data in this

272

manner can also serve to identify any samples or sample lots that may be outliers, which can be

273

excluded in further analyses.

274 275

Statistical Analysis: T-Test and Fold Change

276

An unpaired t-test was performed for each of the molecular features after data filtering. Multiple

277

comparisons of the control (unadulterated food matrix extracts) and suspect groups (lots with

278

analytical standard) included the combined spiked matrices, 10, 100, and 500 ppb spikes,

279

respectively (listed in Table S2). Additionally, unadulterated lots 1 and lot 6, respectively from

280

I.F. and O.J., were compared against their matrix control groups. As expected, the number of

281

molecular features found to be statistically different between sample groupings decrease with a

282

decrease in p-value threshold (FDR level), which is illustrated in Figure S3A. However, if a

283

large number of features are found to be statistically significant between sample groupings, a

284

certain number will be expected by chance; this is reflected in Figure S3B. For example,

285

because a large number (>450) of features was found to be statistically different between the 500

286

ppb and control O.J. groups, approximately 25 of these features would be expected by chance

287

with an adjusted p-value of 0.05. To reduce the number of potential molecular features, the p-

288

value threshold for each comparison was chosen where zero features are expected by chance,

289

which is automatically calculated within MPP (Table S3). Alternatively, a smaller p-value

13 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 27

290

threshold (e.g., 0.001) could be consistently chosen when a large number of distinguishing

291

features are found, but at the risk of removing compounds of concern from the data set. Here, a

292

judgment call needs to be made to determine an appropriate threshold based on the number of

293

potential molecular feature candidates that differentiate sample groupings.

294

Molecular features were then limited to ones that exhibited a 2-fold increase compared to

295

the control group. The total number of features shown to initially distinguish the spike and

296

control groups is displayed in Figure 3 (indicated in blue). This specific compound list was then

297

searched again in the raw data files. Finding the presence of specific compounds is generally

298

more definitive compared to the initial feature extraction because it searches the same m/z value

299

within a specified retention time window for all data files rather than generically searching for

300

eluting compounds. Because of this, the repeat filtering step was chosen to be more restrictive

301

than the initial filtering, where 80% of features needed to be observed in a sample grouping to be

302

included in the subsequent statistical comparison.

303

The statistical analysis was then repeated on the extracted molecular features using this

304

specified search. The features that distinguish sample groupings decrease after the repeat

305

analysis, demonstrating that this recursive data treatment was necessary (indicated by red, Figure

306

3). This was likely due to features that were not reproducibly detected and/or extracted from the

307

data set. An automated recursion process would remove some of these manual processing steps,

308

which some vendors are beginning to develop (e.g., Agilent’s Profinder). In all comparisons of

309

the control and standard spike groupings, including the 10 ppb and control comparison, the four

310

compounds in the analytical standard mixture were able to be parsed from the data, which

311

demonstrates the utility of this developed data analysis workflow.

14 ACS Paragon Plus Environment

Page 15 of 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

312

Analytical Chemistry

Comparing the spiked matrices to a control group ensured the statistical data analysis

313

methodology was sufficient for different concentrations of compounds present in two very

314

different sample matrices. If spiked matrices are incorporated as a QC step, this data can also be

315

used to test and optimize the statistical data analysis workflow. Different processing parameters

316

or statistical tests may need to be used with different sample matrices or when greater molecular

317

differences are expected due to matrix complexity or variability within a matrix type.

318

In the I.F. comparisons, most of the t-test sets resulted in approximately 20 molecular

319

features, which is a manageable number to attempt to identify (indicated in red, Figure 3).

320

However, in the 500 ppb I.F. comparison, greater than 60 distinguishing compounds were found.

321

Limiting the number of compounds of interest is essential because identification for even a single

322

compound can be lengthy, but this must be done without removing potentially harmful

323

compounds from the data set. In the lot 1 versus control I.F. comparison, less than 10

324

compounds were found to distinguish these sample groups, which means it is likely that many of

325

the features detected in the other comparisons are associated with the standard mixture. This is

326

also indicated by an increase in the number of features with the amount of standard added

327

(Figure 3).

328

This same trend can be observed in the O.J. sample comparisons where 100 compounds

329

are found to distinguish the 500 ppb level from the control (indicated in red, Figure 3). The

330

resultant feature lists for the 500 ppb spike and control group comparisons for both I.F. and O.J.

331

were evaluated against features found in the blanks and in their respective unspiked lots (lots 1

332

and 6 for I.F. and O.J., respectively). None of the molecular features were found in the blank.

333

There was no overlap for I.F., while 3 of the 99 features were intrinsic to lot 6 (Figure S4).

334

Therefore, the majority of features are associated with the addition of the standard mixture.

15 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 27

335

Thus, related molecular features can be removed if they coelute with another compound that

336

exhibited greater ion intensity. An example of this process is shown in Table S5. Compounds

337

within a 0.02 min retention time window were manually binned together, with the most abundant

338

compound being submitted for further analysis. As indicated in green in Figure 3, this decreases

339

the number of molecular features by more than half. While this optional strategy may eliminate

340

features of interest, it was employed here to reduce analysis time.

341 342

Molecular Formulae Generation and Database Searching

343

Molecular formulae were generated for statistically different compounds between the control and

344

suspect comparisons. The correct molecular formulae were generated for the compounds in the

345

standard mixture and many of these formulae were the top scoring result; however, there were

346

examples that were not. In the 10, 100, and 500 ppb comparisons in I.F., the molecular formula

347

of colchicine was not the top formula match for that detected compound. In the 500 ppb

348

comparison, there were three molecular formulae generated, each with scores greater than 97 out

349

of a scale of 100. The molecular formula score evaluates metrics including the signal-to-noise,

350

retention time, chromatographic peak width and shape, isotope pattern, and the mass difference

351

amongst related ions such as dimers, trimers, and other adducts. The score reflects the

352

probability that the feature is a real compound, with a score of 100 being a perfect fit. It is not

353

obvious why the correct formula was not the top hit because the measured mass accuracy

354

compared to the colchicine molecular formula is actually better than the top scoring formula. The

355

relative isotopic distributions for both molecular formulae are also similar. For the O.J.

356

comparisons, an incorrect molecular formula was generated as the top match instead of the

357

formula for ricinine in the all spike, 100 ppb, and 500 ppb comparisons. In the 500 ppb O.J.

16 ACS Paragon Plus Environment

Page 17 of 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

358

comparison example, the score for the generated molecular formula of ricinine was 97.6, but the

359

mass accuracy error compared against this compound was worse than the value reported for the

360

top matching molecular formula (3.3 ppm). This example emphasizes the need for good data

361

quality; high mass accuracy and accurately measured isotopic ratios can aid in generating the

362

correct molecular formula.2 Furthermore, q-TOF instruments can be susceptible to higher mass

363

accuracy error at high peak abundance due to saturation;18 however, the m/z value reported was

364

taken below the 30% saturation level (MassHunter indicates saturated compounds) from the peak

365

shoulders to minimize this effect. Because the best generated molecular formula may not

366

correspond to the detected compound, multiple molecular formulae may need to be considered,

367

where a cutoff score could be implemented.

368

There are multiple small molecule compound databases that are continuously being

369

created or updated. These databases are often not all encompassing, so multiple databases

370

should be searched. Some vendor software can automatically search against certain molecular

371

databases, but the capability to automatically link to multiple databases within the chosen

372

analysis software rather than manually searching available databases would be beneficial. The

373

reduced molecular formulae list generated by comparing 10 ppb and control O.J. groups was

374

manually searched against four commonly used online databases (Table 1). The number of

375

compounds that match a given molecular formula widely vary, which emphasizes the importance

376

of a manageable number of compounds to identify. If the suspect compound has been studied

377

previously, prioritizing the compounds by the number of references associated with it can be

378

useful and can be accomplished within ChemSpider.19 However, if the compound is an

379

emerging risk, it may not be well characterized. SciFinder also allows for prioritizing

380

compounds, where the list of compounds can be reduced by a variety of properties including

17 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 27

381

toxicity. This could be particularly useful if the molecular species is known to cause an illness.

382

However, if the compound is truly an unknown, searching against a molecular database will be

383

insufficient for identification and MS/MS approaches will be required for structure elucidation.

384

Likewise, MS/MS analysis would be necessary to confirm a database assignment and to aid in

385

identification when no molecular formula is generated, as is the case for two compounds in Table

386

S5.

387

Searching the Metlin database revealed two metabolites of colchicine (gloriosine and

388

desmethylcolchicine) that match molecular formula C21H23NO6 (m/z 385) listed in Table 1;

389

these two compounds were also present as impurities in the standard solution. One of the

390

compounds matching this molecular formula was removed from the list in Table S5 because it

391

coeluted with colchicine. The MS/MS fragmentation listed in the Metlin database indicates that

392

this is not a product ion of colchicine, which implies that the similar chemical structure of this

393

metabolite causes its coelution. Furthermore, extracted ion chromatograms of colchicine (m/z

394

400) and its putative metabolite (m/z 385) do not completely overlay, which further supports that

395

m/z 386 is not a product ion of colchine (Figure S5). The two metabolites of colchicine were

396

found in all of the comparisons for both I.F. and O.J., which is additional confirmation of the

397

data analysis workflow functionality.

398 399

Conclusions

400

Validating method performance is critical in ensuring that hazardous compounds will be parsed

401

from the data set. As mentioned previously, analyzing both QC standards and QC matrix

402

samples is beneficial in determining instrumental platform and data analysis performance. Data

403

processing settings are not universal for all matrices and data sets. QC matrix samples provide

18 ACS Paragon Plus Environment

Page 19 of 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

404

the ability to adjust those settings, including those for feature extraction, so that true unknowns

405

have the highest probability of being determined in complex food samples. Additionally, QC

406

samples can be used to determine if the chromatographic method resolves eluting peaks and

407

results in consistent retention times and mass accuracy, which will enable identical compounds

408

detected in multiple analyses to be properly binned. Irregular chromatographic peak shapes will

409

also make feature comparison more challenging. If coelution or poor peak shapes affect the

410

feature extraction of QC compounds, the chromatography should be optimized. It is also worth

411

noting that sufficient sample preparation, in addition to chromatography, is also required to

412

ensure that compounds are detected from samples of interest, although this was outside the scope

413

of this work.

414

Analyzing blanks within the sample sequence is also necessary. This was not critical in

415

the present study because the samples were analyzed by the instrument within the same week. If

416

collected data needs to be compared with data analyzed in previous months or years, the

417

chemical background of the instrument platform may not be identical; this can also occur on a

418

much smaller timescale (few days) and may lead to false positives. The collected data from

419

blank injections can be used to ensure that differentiating compounds are not from the chemical

420

background of the system. Similarly, analyzing the data in random order in the acquisition

421

sequence will reduce this potential source of error. Incorporating an internal standard into each

422

of the samples can also serve as a normalization factor to account for any instrumental

423

performance differences.

424

Sample matrices where low lot-to-lot variability or a lower number of molecular features

425

are expected will be easier to compare, especially between months, and even years. Despite the

426

O.J. being purchased on the same date and from the same brand, some of the lots were different

19 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 27

427

than the others (Figure 2). This emphasizes the need to analyze a sufficiently large number of

428

replicates and lots for a particular sample type to ensure a representative sample grouping. It is

429

expected that the molecular profile of O.J. could vary depending on the brand and blend, where it

430

is grown, the types of oranges used, and the climate. If there are inherent differences to the

431

samples being analyzed, there will be a larger number of statistically relevant molecular features

432

in addition to any adulterants/contaminants present, so additionally identifying these compounds

433

will increase analysis time. However, this is still an improvement compared to identifying all

434

compounds within a given sample. For sample types where a larger intralot variability is

435

expected within a sample type, a database for compounds that are common could be generated,

436

which could also reduce analysis time. Of course, these samples would need to be void of any

437

potential hazards.

438

While further advancement is needed to improve high-throughput identification, current

439

software tools are sufficient to detect molecular differences in food matrices, in spite of chemical

440

complexity. Statistical comparisons can be successful if appropriate quality controls are

441

implemented and if adequate sampling accounts for potential molecular variation within the

442

same sample type. This developed data analysis workflow can be used as a model for statistical

443

elucidation of compounds present in suspect food samples.

444 445

Abbreviations Used

446

LC/MS liquid chromatography coupled to mass spectrometry

447

HR-MS high-resolution mass spectrometry

448

MS/MS tandem mass spectrometry

449

I.F. infant formula

20 ACS Paragon Plus Environment

Page 21 of 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

450

O.J. orange juice

451

QC quality control

452

MPP Mass Profiler Professional

453

PCA Principal Component Analysis

454 455

Acknowledgments

456

The authors would like to thank John Ihrie (FDA) for insightful discussions concerning the

457

statistical treatment of the data.

458 459

Supporting Information

460

Supporting figures (Figures S1-S5) and tables (Table S1-S5) as noted in the text.

461 462

References

463

(1) Knolhoff, A. M.; Callahan, J. H.; Croley, T. R. J. Am. Soc. Mass Spectrom. 2014, 25, 1285-

464

1294.

465

(2) Kind, T.; Fiehn, O. BMC Bioinformatics 2007, 8, 105.

466

(3) Castro-Puyana, M.; Herrero, M. TrAC-Trend. Anal. Chem. 2013, 52, 74-87.

467

(4) García-Cañas, V.; Simó, C.; Herrero, M.; Ibáñez, E.; Cifuentes, A. Anal. Chem. 2012.

468

(5) Herrero, M.; Simó, C.; García-Cañas, V.; Ibáñez, E.; Cifuentes, A. Mass Spectrom. Rev.

469

2012, 31, 49-69.

470

(6) Hu, C.; Xu, G. TrAC-Trend. Anal. Chem. 2013, 52, 36-46.

471

(7) Scalbert, A.; Andres-Lacueva, C.; Arita, M.; Kroon, P.; Manach, C.; Urpi-Sarda, M.;

472

Wishart, D. J. Agr. Food Chem. 2011, 59, 4331-4348.

21 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 27

473

(8) Moore, J. C.; Spink, J.; Lipp, M. J. Food Sci. 2012, 77, R118-R126.

474

(9) Pan, Z.; Gu, H.; Talaty, N.; Chen, H.; Shanaiah, N.; Hainline, B.; Cooks, R. G.; Raftery, D.

475

Anal. Bioanal. Chem. 2007, 387, 539-549.

476

(10) Wang, C.; Kong, H.; Guan, Y.; Yang, J.; Gu, J.; Yang, S.; Xu, G. Anal. Chem. 2005, 77,

477

4108-4116.

478

(11) Knolhoff, A. M.; Croley, T. R. J. Chrom. A, DOI:

479

http://dx.doi.org/10.1016/j.chroma.2015.08.059.

480

(12) Vaclavik, L.; Schreiber, A.; Lacina, O.; Cajka, T.; Hajslova, J. Metabolomics 2012, 8, 793-

481

803.

482

(13) Cotton, J.; Leroux, F.; Broudin, S.; Marie, M.; Corman, B.; Tabet, J.-C.; Ducruix, C.; Junot,

483

C. J. Agr. Food Chem. 2014, 62, 11335-11345.

484

(14) Vaclavik, L.; Lacina, O.; Hajslova, J.; Zweigenbaum, J. Anal. Chim. Acta 2011, 685, 45-51.

485

(15) Tengstrand, E.; Rosén, J.; Hellenäs, K.-E.; Åberg, K. M. Anal. Bioanal. Chem. 2013, 405,

486

1237-1243.

487

(16) Bolton, E., Wang, Y., Thiessen, P.A., Bryant, S.H. In Annu. Rep. Comput. Chem.; Elsevier:

488

Oxford, UK, 2008, pp 217-240.

489

(17) Smith, C. A.; O'Maille, G.; Want, E. J.; Qin, C.; Trauger, S. A.; Brandon, T. R.; Custodio,

490

D. E.; Abagyan, R.; Siuzdak, G. Ther. Drug Monit. 2005, 27, 747-751.

491

(18) Bristow, T.; Constantine, J.; Harrison, M.; Cavoit, F. Rapid Commun. Mass Spectrom. 2008,

492

22, 1213-1222.

493

(19) Little, J.; Williams, A.; Pshenichnov, A.; Tkachenko, V. J. Am. Soc. Mass Spectrom. 2011,

494

23, 179-185.

495

22 ACS Paragon Plus Environment

Page 23 of 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Tables Table 1. Number of compounds found for generated molecular formulae for differentiating analytes from the 10 ppb spiked and control orange juice groups. The molecular formulae from the standard mixture are outlined in green.

Generated Molecular Formula C 8H 8N 2O 2 C15H30O2S2 C18H21NO3 C16H10N7O3 C17H25NO10 C21H26N2O3 C29H47N5OS4 C21H23NO6 C22H25NO6 C22H22O9

ChemSpider

SciFinder

PubChem

Metlin

515 1 3974 2 21 6984 0 1786 1721 114

1414 36 9675 0 129 11430 0 2414 2301 360

1 0 15 0 1 3 0 0 2 1

1 0 13 0 0 9 0 8 3 17

23 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 27

Figure Graphics Figure 1. Data analysis workflow for distinguishing suspect and control groups. A. Overall data processing to aid in identifying unknown compounds using a statistical approach. B. Procedure implemented for data filtering and statistical analysis.

24 ACS Paragon Plus Environment

Page 25 of 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Figure 2. PCA of selected comparisons of sample groupings. “Spiked vs Control” compares the unadulterated lots with all the matrices spiked with different concentrations of analytical standard, while “10 ppb vs Control” includes unadulterated lots and only the 10 ppb spiked matrices. “Lot 1 vs Control” and “Lot 6 vs Control” enabled any molecular differences inherent to Lot 1 or 6 to be monitored without the contribution of the analytical standard. The unadulterated lots are also compared in “Lot Comparison” to observe the molecular differences present between lots of the same brand of sample matrix.

25 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 27

Figure 3. Number of molecular features found to differ between sample groupings for different stages of the data analysis process.

26 ACS Paragon Plus Environment

Page 27 of 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Graphic for Table of Contents

27 ACS Paragon Plus Environment