Use of Comprehensive Two-Dimensional Gas ... - ACS Publications

Jun 13, 2016 - ABSTRACT: In this proof of concept study, chemical threat agent (CTA) samples were classified to their sources with accuracies of 87−...
0 downloads 8 Views 675KB Size
Subscriber access provided by UNIV OF NEBRASKA - LINCOLN

Article

Use of Comprehensive Two-Dimensional Gas Chromatography with Time-of-Flight Mass Spectrometric Detection and Random Forest Pattern Recognition Techniques for Classifying Chemical Threat Agents and Detecting Chemical Attribution Signatures Erich D. Strozier, Douglas D. Mooney, David A. Friedenberg, Theodore P. Klupinski, and Cheryl A. Triplett Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.6b00725 • Publication Date (Web): 13 Jun 2016 Downloaded from http://pubs.acs.org on June 15, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Analytical Chemistry is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 22

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Use of Comprehensive Two-Dimensional Gas Chromatography with Time-of-Flight Mass Spectrometric Detection and Random Forest Pattern Recognition Techniques for Classifying Chemical Threat Agents and Detecting Chemical Attribution Signatures Erich D. Strozier 1*, Douglas D. Mooney1,2, David A. Friedenberg1, Theodore P. Klupinski1, and Cheryl A. Triplett1 1. Battelle Memorial Institute, 505 King Avenue, Columbus, Ohio 43201 2. Early Moon, LLC, 1391 West 5th Avenue Suite 423, Columbus OH 43212 * To whom correspondence should be addressed. E-mail: [email protected], Fax: 614-4583546.

ABSTRACT

In this proof of concept study chemical threat agent (CTA) samples were classified to their sources with accuracies of 87% to 100% by applying a random forest statistical pattern recognition technique to analytical data acquired by comprehensive two-dimensional gas chromatography with time-of-flight mass spectrometric detection (GC×GC-TOFMS). Three organophosphate pesticides—chlorpyrifos, dichlorvos, and dicrotophos—were used as the model CTAs, with data collected for 4–6 sources per CTA and 7–10 replicate analyses per source. The analytical data were also evaluated to determine tentatively identified chemical attribution signatures for the CTAs by comparing samples from different sources according to either the presence/absence of peaks or the relative responses of peaks. These results demonstrate that GC×GC-TOFMS analysis in combination with a random forest technique can be useful in sample classification and signature identification for pesticides. Furthermore, the results suggest that this combination of analytical chemistry and statistical approaches can be applied to forensic analysis of other chemicals for similar purposes.

INTRODUCTION

In the event of a criminal or terrorist action involving a chemical threat agent (CTA), identifying the source of the chemical used could lead investigators to the responsible parties. In this paper, 1 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 22

we describe a method for classifying the sources of CTAs using a combination of comprehensive two-dimensional gas chromatography with time-of-flight mass spectrometric detection (GC×GCTOFMS) and a state-of-the-art pattern recognition algorithm called random forest.

Commonly used analysis techniques such gas chromatography coupled with mass spectral detection can be used to positively identify the parent CTA materials. However, these techniques offer little in the way of providing forensic information, such as the CTA source or manufacturing method. Components other than the active ingredients (e.g., manufacturing precursors or byproducts) are often present in commercial preparations of CTAs and may offer information on the synthesis methods or manufacturing processes used. Unlike the active ingredients, these components are typically not known beforehand. Thus, to reliably classify the source, one must first identify the distinguishing components. If detectable, these components provide “fingerprints” for specific sources of CTAs. Comprehensive two-dimensional gas chromatography with time-of-flight mass spectrometric detection (GC×GC-TOFMS) offers a greater potential for forensic identification of CTAs than traditional analysis methods. The separation of chemical components by two orthogonal properties (e.g., boiling point in the first dimension and polarity in the second dimension) greatly expands the chromatographic space in which compounds can be separated from one another.1 Time-of-flight mass spectrometry provides sensitivity approaching quadrupole selective ion monitoring (SIM) but has the advantage of collecting spectra in full-scan mode, thus allowing for sensitive non-targeted analysis of complex mixtures. Furthermore, the collection of full-scan mass spectra allows for the use of spectral deconvolution software, which can be used to further separate interfered or overlapping component peaks.

A common limitation encountered in GC×GC-TOFMS analysis is sorting through the large amounts of data obtained from a single sample to find meaningful components in the comparison of multiple samples. In some cases, patterns can be identified in the data through visual inspection. Examples include the analysis of weathered oil samples2 or preparations of the rodenticide tetramethylenedisulfotetramine (TETS) from different synthetic routes.3 The use of statistical techniques for data processing has seen much advancement, as described in several recent review articles about GC×GC-TOFMS and other two-dimensional separation 2 ACS Paragon Plus Environment

Page 3 of 22

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

techniques.4–9 One common statistical evaluation technique used with GC×GC-TOFMS is principal component analysis (PCA), which has been applied—to cite just a few of many examples—to assess the ages of samples of radix ginseng volatile oil,10 to identify potential biomarkers of perinatal asphyxia in non-human primates,11 to determine chemical signatures of ecstasy manufacturing,12 and to assign the origins of coal tars from manufactured gas plants.12 A second common technique is partial least squares discriminant analysis (PLSDA), which has been applied, for example, in characterizing the particulate matter in smoke from different types of cigarettes13 and identifying compounds in exhaled breath possibly associated with allergic asthma.14 Another common technique is nonnegative matrix factorization (NNMF), which has been used to identify chemical impurity profiles for a chemical weapon precursor.15

PCA, PLSDA, and NNMF all involve reducing the dimension of the data set in order to discover trends among related samples.16,17 Using one of these techniques for the analysis of complex mixtures may unintentionally lead to ignoring the influence of components that may be small but important for classification. Moreover, these techniques can be susceptible to over-fitting in practice. For example, over-fitting was observed to be a problem when using PLSDA to evaluate GC×GC-TOFMS data for a metabolomics study.18 Furthermore, these algorithms are typically outperformed by state of the art pattern recognition algorithms that do not require dimension reduction.19 Some research has been conducted to apply different types of pattern recognition techniques—including those particularly suitable for data sets with very large numbers of variables relative to the number of observations—to evaluate GC×GC-TOFMS data.5,18,20 These possibilities, however, have not been thoroughly explored.

In this proof of concept study, we classify CTA samples according to their sources using GC×GC-TOFMS analysis in combination with random forest classification techniques,16,21 which consider all peaks present in the data without the need for dimension reduction. We demonstrate the effectiveness of this approach using standard statistical metrics and also by accurately predicting the sources of specific samples when the analyst was blind to the true source identities. Additionally, the GC×GC-TOFMS data are used to identify potential sourcedistinguishing components—referred to as chemical attribution signatures (CASs)—associated with the selected CTAs. 3 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 22

EXPERIMENTAL SECTION

Test Chemical Selection Organophosphate pesticides (OPPs) were selected as example CTAs for demonstration of the selected approach for sample classification as OPPs are a group of highly toxic compounds that may be attractive to terrorist and criminal elements for use as CTAs.22, 23 For each of three selected OPPs, four to six neat or formulated sources were acquired from U.S. and non-U.S. manufacturers, as listed in Table 1. To support a robust demonstration for satisfying the primary objective of the study, the sources chosen for a given OPP represented at least three companies and at least two countries of origin. The selected OPPs—dichlorvos, dicrotophos, and chlorpyrifos—represent potential threats for use as CTAs in terms of toxicity and worldwide availability. Dichlorvos and dicrotophos are each classified as restricted use, class 1, or “most toxic” pesticides by the EPA.24 Chlorpyrifos, which is less toxic,25 was selected for study due to its wide use and availability worldwide.

Table 1. List of OPP compounds, including sources and source codes. Compound Dichlorvos CASRN 62-73-7 Oral LD50 Rat= 56 mg/kg

Dicrotophos CASRN 141-66-2 Oral LD50 Rat= 22 mg/kg

Source Code

Source, Origin

Formulation

Purity (% )

MXN

Manufacturer X

neat

99.1

MX4

Manufacturer X

emulsifiable concentrate

41.5

RN

Ruina International, China neat

95

R5

Ruina International, China emulsifiable concentrate

50

PsN

Fluka, Germany

neat

99.9

SgN

Chem Service, US

neat

98.8

MYN

Manufacturer Y

neat

86.8

MY8

Manufacturer Y

water miscible

85.6

PsN

Fluka, Germany

neat

96.7

SgN

Chem Service, US

neat

99.5

DwUSN

Dow Agrosciences, US

neat

99.7

DwUKN

Dow Agrosciences, UK

neat

98

Dow Agrosciences, US

emulsifiable concentrate

45

Dow Agrosciences, UK

emulsifiable concentrate

45

Chlorpyrifos CASRN 2921-88-2 DwUS4 Oral LD50 Rat= DwUK4 135 mg/kg PsN SgN

Fluka, Germany

neat

99.9

Supelco, US

neat

99.9

CASRN = Chemical Abstract Services registry number Oral LD50 Rat = median lethal dose of compounds from oral testing with rats23

4 ACS Paragon Plus Environment

Page 5 of 22

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Sample Preparation Stock solutions were prepared for each OPP source in acetone at nominal concentrations of 1 mg/mL. Ten undiluted aliquots of each stock solution were prepared for GC×GC-TOFMS analysis by adding six isotopically-labeled internal standard compounds (acenaphthene-d10, chrysene-d12, 1,4-dichlorobenzene-d4, naphthalene-d8, perylene-d12, and phenanthrene-d10) to each sample at concentrations of 200 ng/mL. The internal standards were used for normalization of sample component responses during the data processing portion of the study. No further dilutions were made as the intent was to analyze the samples not for quantification of OPPs, but instead for identification of impurities that could be present in the OPP sources at relative concentrations of about 0.01% to 10%. At the same time, ten aliquots of the solvent were prepared using the same internal standards to serve as process blanks, abbreviated as “Bk”. The use of ten dilutions per source is important to provide a data subset that can represent the variability associated with repeatable processes such as sample dilution and instrument analysis.

GC×GC-TOFMS Analysis Sample analyses were performed using a Leco Pegasus 4D GC×GC-TOFMS (Leco, St Joseph, MI, USA). The Pegasus 4D utilizes an Agilent 6890 gas chromatograph fitted with a cryogenically cooled two-stage modulator, and a secondary temperature-programmable oven both mounted inside the main GC oven. The GC is coupled to the Pegasus III time-of-flight mass spectrometer. Instrument specifications and method conditions are provided in Table S-1.

For each compound, the replicate samples were analyzed by GC×GC-TOFMS on 10 different days. Each day included an injection of one replicate from each of the sources in order to minimize the possible influence of aging effects or day-to-day variability in instrument performance. Randomization was applied to determine the specific samples analyzed on different days and the sample injection sequence on each day. Only seven of the ten chlorpyrifos PsN samples provided usable data due to corrupted instrument raw data files. For each of the other sources, all ten sample aliquots provided usable data.

5 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 22

Data acquisition, pre-pattern-recognition peak-finding, and spectral deconvolution were implemented using all acquired mass spectral data channels (m/z 35 – 600) and performed using Leco ChromaTOF v3.21 software. Tentative component identifications were performed by automated matching of deconvoluted component mass spectra with the National Institute of Standards and Technology (NIST) 05 Mass Spectral Library. Peak tables were generated showing the retention time, peak height, best mass spectral library match, CASRN (Chemical Abstract Services registry number) of the best match, and match quality. Representative threedimensional chromatograms for the test materials are shown in Figures S-2 through S-6.

Data Preprocessing Prior to application of the pattern recognition techniques, GC×GC-TOFMS sample peak tables were filtered for known analysis system artifacts such as column siloxane bleed and injection solvent. After filtering, sample results included several hundred to several thousand component peaks, depending on the source. These filtered peak tables were used as input to the pattern recognition method development process. Due to the large amount of data acquired, analyst verification of all spectra and identity verification of all retained components was not feasible; however, the compound names and CASRNs are useful tags applied by the software to components, regardless of whether the components are the indicated compounds. Thus, the term “compound,” when used throughout the discussion of the GC×GC-TOFMS results, truly means “tentatively identified compound.”

For each injection, a given single component may be deconvoluted by the data processing software as multiple peaks in retention-time space. It is known that strong peaks are sometimes represented across much of the 2nd-dimension retention-time range (3 seconds in total), while the 1st-dimension retention time is generally accurate to within 6 to 9 seconds. The instrument integration software does not provide a suitable means of summing these multiple peaks. To accommodate this expected analytical variability for a particular compound, we took as a base, the retention time pair corresponding to the largest peak. We then added to the response for this peak the sum of all responses for peaks with the same CASRN found within 6 seconds of the base 1st-dimension retention time and anywhere along the axis for the 2nd-dimension retention time. Thus, responses for all compound-associated peaks that lie within a rectangle 12 seconds 6 ACS Paragon Plus Environment

Page 7 of 22

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

wide in the first dimension by 3 seconds wide in the second dimension were summed together. In practice, the distribution of data points within this rectangle often has a roughly oval shape, and the approach is referred to as the “Oval Area” method. The Oval Area method proved to be useful in controlling the data artifacts and gave stronger results than using the maximum peak value or other sums with various thresholds. A heat map plot depicting the oval area for dichlorvos RN is shown in Figure S-7.

Classification Using Radom Forest Techniques The compounds identified in these samples served as variables, which were used in two ways. One set of analyses was performed using values consisting of either “1” or “0” corresponding to whether or not the compound was present in a sample. These variables are termed in this report as “In/Out variables.” A second set of analyses was performed using the magnitudes of the responses for each compound, where the responses were computed using the Oval Area method discussed previously. These variables are called “Oval Area variables.”

Classification Trees The classification tree is a popular method for classifying observations into groups using a set of predictor variables. An example classification tree for the chlorpyrifos data is presented in Figure 1. In this example, the predictor variables for a sample are the presence or absence of compounds in the sample (i.e., using the In/Out variables). Compounds in the tree are designated by their CASRNs. The initial split for the classification tree in Figure 1 is on the compound 13588-28-8 (2-(2-methoxypropoxy)-1-propanol). This compound is present for all samples on the left branch, corresponding to the DwUK4 source; and absent for all samples on the right branch, corresponding to all other sources. The second split is on the compound 3883621-4 (4-hydroxy-5-methylhexan-2-one), and the splits continue until all samples have been classified. The end nodes, or leaves of the tree, give the identification of data in that branch to the different sources. The ordering is Bk/DwUK4/DwUKN/DwUS4/DwUSN/PsN/SgN. Thus a leaf with the notation 0/10/0/0/0/0/0 indicates that the 10 samples at the end of that branch are all from the DwUK4 source. A single classification tree will often fail to completely capture the complex structures in data because the tree structure is simple by design. Furthermore, small perturbations in the data can sometimes cause significant changes in the structure of the 7 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 22

classification tree. The example classification tree shown in Figure 1 is one of many trees that give perfect separation for the chlorpyrifos data, so the single tree gives useful, but incomplete, information concerning which compounds can distinguish among different sources.

Figure 1. Example classification tree for chlorpyrifos data. Starting at the top: If compound 13588-28-8 is present, classify as DwUK4; otherwise, examine if 38836-21-4 is present. If it is, then classify as DwUS4, otherwise proceed down the tree. The numbers below each leaf show the number of observations in that leaf of each source. The order is Bk/DwUK4/ DwUKN/DwUS4/DwUSN/PsN/SgN. For these data, this tree shows one way to perfectly separate the data, although there may be others.

Random Forests A classification tree provides a simple means of sample classification that is easy to interpret. However, a single tree is not highly robust because of the limitations described above and thus will not predict as well as an ensemble or collection of trees. One of the best ensemble methods is the random forest, which, simply put, builds many trees (a forest) on randomly selected 8 ACS Paragon Plus Environment

Page 9 of 22

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

subsamples of the data and variables.16,26 For this study we used 5,000 trees in each forest. Each tree is conceptually similar to the one in Figure 1, and each tree in the ensemble “votes” for the final classification. Random forest also gives a ranking of how important each variable is to the algorithm. Because only a subsample of the data is used in each tree, the data that are not used can be used to evaluate the algorithm without over-fitting. In this work, we use a modified random forest called the Balanced Random Forest (BRF), which uses a stratified random sample instead of a simple random sample on each tree to ensure that all sources are equally represented.26 Predictions are made using the out-of-bag prediction feature of random forest, which provides a means to make predictions on the data used to fit the forest without overfitting. We utilized the randomForest package in the statistical software R to build the random forest models. 27, 28

Pairwise Analysis and the Simple Importance Measure Two approaches, as described below, were developed to determine the variables that are most important in explaining the separation of two given sources—as opposed to automatic separation of all sources simultaneously. •

Pairwise Analysis: Using the In/Out variables, lists of variables that were present in all samples of one source and absent in all samples of another source were compiled for each pair of sources.



Simple Importance: This novel metric was developed for the Oval Area variables. The idea is to first determine the mean peak height for each Oval Area variable for each source. In general, the further apart the means are, the more influential that variable is. The extent of difference must be tempered, however, with the noise or variability present in the data. Thus, Simple Importance is the ratio of the squared difference between the means for the two sources to the noise present, as expressed in equation (1). The means are squared because the denominator is the variance rather than the standard deviation. ߠ௝ =

(బ) మ

(భ)

ቀఓೕ ିఓೕ ቁ ఙೕమ

for j = 1, …, m

(1) (ଵ)

(଴)

where, for the jth of m variables, ߠ௝ is Simple Importance, ߤ௝ and ߤ௝ are the means of the two sources, and ߪ௝ଶ is the pooled variance. 9 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 22

CAS Verification and Identification The results from pairwise analysis were used to guide comparisons of the raw analytical data for the identification of potential CASs. Differences between selected samples were verified by overlaying extracted ion chromatograms of two or more samples and examining the peak responses and spectra for the subject component (Figure S-8). This procedure was used to verify In/Out as well as Simple Importance results. Tentative CAS component identifications were based solely on sample mass spectrometric data. That is, no retention indices or confirmation analysis using an authentic standard was performed. Components that could be visually matched to a mass spectrum in the NIST 05 library were tentatively identified with the component names. Components that could not be sufficiently matched to a library spectrum were given unique labels according to the molecular formulas assigned by the software.

RESULTS AND DISCUSSION

Samples of dichlorvos and dicrotophos were successfully classified with accuracies of 100% by applying the BRF technique to the GC×GC-TOFMS data. Samples of chlorpyrifos were successfully classified with accuracies of 87% or 97% when using the In/Out variables or the Oval Area variables, respectively. In addition, data evaluation for each OPP yielded lists of several tentatively identified CASs and their associations with specific sources. As an illustrative example, results for dichlorvos are presented with a detailed discussion. Brief summaries of the results for dicrotophos and chlorpyrifos are then presented. Finally, we demonstrate the predictive power of the approach by correctly classifying four samples for which the analyst was blind to the true sources.

Dichlorvos: Classification The dichlorvos data consisted of 70 sample aliquots from six sources and a blank. Use of the BRF method on either the In/Out or Oval Area variables resulted in correct classification rates of 100%, as illustrated in the confusion matrices (see Table 2 and S-9). Confusion matrices are a common method to illustrate the results of sample classification, with row labels used to indicate the true sample identity and column labels used to indicate the predicted classification. If a 10 ACS Paragon Plus Environment

Page 11 of 22

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

sample is classified correctly, the result for that sample will be tallied in the cell for row x and column x (i.e., on the matrix diagonal). If a sample is classified incorrectly, the result for that sample will be tallied in another cell (i.e., off the matrix diagonal). The classification rate is calculated simply as the sum of all tallies on the matrix diagonal divided by the sum of all tallies in the matrix (e.g., 70/70 = 100% for Table 2).

Table 2. Confusion matrix for BRF using In/Out variables on dichlorvos data. Row labels indicate true identities, and column labels indicate predicted classifications. Green shading indicates correct classifications. MX4 MXN Bk PsN R5 RN SgN

MX4 MXN Bk PsN R5 RN SgN 10 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 10

Dichlorvos: CAS Identification Pairwise analysis was performed to generate lists of compounds that perfectly separated two sources according to the In/Out variables—meaning that a given compound was present in all samples of one source and absent in all samples of the other. The tallies of such compounds for all pairs of sources are indicated in Table 3. Each pair has at least 8 compounds that provide perfect separation. The Simple Importance measure is also useful in identifying compounds that can separate each pair of sources, where a large Simple Importance measure indicates that the difference in mean values is large relative to the variability in the data. As shown in Table S-10, the top 1% of all Simple Importance values for dichlorvos are distributed widely across all pairs of sources, suggesting that multiple compounds may be effective for distinguishing between two arbitrarily selected sources.

11 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 22

Table 3. The numbers of compounds that perfectly separate pairs of sources for dichlorvos according to In/Out variables MX4 MXN Bk PsN R5 RN SgN

MX4 MXN Bk PsN R5 RN SgN -98 115 104 76 88 114 -31 10 97 8 18 -25 110 40 62 -85 10 17 -79 110 -25 --

Compounds that may serve as potential CASs for classifying a sample to a particular source are very likely to be among those compounds included in Tables 4 and 5. Verification of all these compounds was not feasible due to the large number. Therefore, only selected compounds were verified by an analyst. The selected compounds included all compounds identified using the In/Out variables (Table 3) for which detections are reported in a neat source. The selected compounds also included several identified by Simple Importance (Table S-10), where the complete list of tallied compounds was ranked according to frequency of occurrence and compounds were examined by proceeding through the list until four CASs were tentatively identified on the basis of mass spectral library matching.

Analyst examination revealed that several of the potential CASs represented true differences between sources. Those results are considered as “tentatively identified” CASs. In some cases, however, the assigned differences were found to be incorrect due to ineffective compound identification by the mass spectral library in a subset of the samples, possibly due to interferences in the sample mass spectra. Other cases where false differences were assigned are associated with the In/Out variables. In these cases, a compound indicated as present in some samples and absent in others was actually present in the latter samples at levels lower than the software detection threshold. Such issues with detection thresholds do not diminish the usefulness of the In/Out variables as long as the meaning of a non-detection is well defined and understood.

12 ACS Paragon Plus Environment

Page 13 of 22

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

The results from the examination of selected potential CASs from the In/Out variables are shown in Tables 6 and S-11, where the third column indicates whether the compound was tentatively identified, the fourth column indicates chemical relevance, and the last four columns indicate whether the compound was detected in each of the four neat material sources. Only tentatively identified signatures are shown in Table 5 and all verified signatures, regardless of identity status, are shown in Table S-11.

Designations for chemical relevance were made according to

whether the compound has a chemical similarity or another known relationship (e.g., synthetic precursor) to the tested pesticide (i.e., dichlorvos); such evaluations were made only for those compounds that were tentatively identified. Evaluation of the In/Out variables led to the assignment of 11 tentatively identified dichlorvos CASs, listed in the first 11 rows. Five of these tentatively identified CASs (1,1,1,2-tetracloroethane; tetrachloroethylene; trichloroacetic acid, ethyl ester; dichloroacetic acid, ethyl ester; and pentachloroethane) display distinctive behavior in one of the neat sources, as indicated by the shaded cells.

There were numerous peaks that could not be tentatively identified according to the mass spectra, as listed in the rows toward the bottom of Table S-11. These peaks, however, may still be useful in classifying the sources according to the data processing and statistical evaluation techniques used. In this case, 15 non-identified peaks display distinctive behavior in one of the neat sources. Although not performed for this study, the chemical identity associated with any specific peak could be determined using appropriate analytical chemistry experimental techniques.

The MXN, RN, and SgN sources have one, two, and seven potential CASs, respectively, that show distinctive positive results. It is notable that the PsN source—which has the highest purity (99.9%, per Table 1) of any dichlorvos source tested—has no potential CASs showing distinctive positive results but includes seven potential CASs showing distinctive negative results. Thus, the PsN source may be the least affected by impurities on a qualitative basis as well as the most pure on a quantitative basis. Over all, the MXN source may be the most difficult to classify on the basis of the potential CASs alone, as there is only one potential CAS that has a distinctive result for MXN. Note that CASs determined for the formulated OPSs (MX4 and RN5), are not included in Table 5 as they were too numerous to verify. 13 ACS Paragon Plus Environment

Analytical Chemistry

Table 5. Verification status of dichlorvos tentatively identified CASs based on In/Out variables in neat sources only. Shaded cells indicate components that are either distinctively present or distinctively absent in one source.

Mass Spectral Library Best Match Compound

CAS#

Tentatively Identified Chemically Relevant

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 22

Detection Status

MXN RN SgN PsN Ethane, 1,1,1,2-tetrachloro630-20-6 y y y n n n Tetrachloroethylene 127-18-4 y y n y n n 2-Butanol, 1-chloro1873-25-2 y y y t y y Acetic acid, trichloro-, ethyl ester 515-84-4 y y y y n y Acetic acid, dichloro-, ethyl ester 535-15-9 y y y y y n Acetic acid, trichloro-, methyl ester 598-99-2 y y y y n n Ethane, 1,1,2-trichloro79-00-5 y y y n y n Benzene, 1,2,3-trimethyl526-73-8 y u t y t t Phenol, 2,4,6-trichloro88-06-2 y u t n y n Ethane, pentachloro76-01-7 y y n y n n Acetic acid, chloro79-11-8 y y n t y y y - yes; n - no; t - peak observed below software detection threshold; n/a - not available; u - unknown

Analyst examination of selected potential CASs from the Simple Importance measures yielded four tentatively identified CASs: trichloroacetic acid, methyl ester; dichloroacetic acid, ethyl ester; phosphoric acid, 2-chloroethenyl dimethyl ester; and dimethyl methylphosphonate. Only the first two of these compounds had been identified from the In/Out variables, per Table 5. The complementary results obtained from the In/Out variables and the Simple Importance measures indicate the importance of using a comprehensive approach to data evaluation to take advantage of the wealth of information available from GC×GC-TOFMS analysis.

Dicrotophos: Classification and CAS Identification The dicrotophos data consisted of 50 sample aliquots from four sources and a blank. Use of the BRF method on either the In/Out or Oval Area variables resulted in correct classification rates of 100% (Tables S-12 and S-13). Tallies were made of the compounds that provide perfect separation according to the In/Out variables and that have relatively large Simple Importance measures according to the Oval Area variables (Tables S-14 and S-15). The In/Out variables were evaluated to assign 18 tentatively identified CASs and also to record results for the non14 ACS Paragon Plus Environment

Page 15 of 22

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

identified CASs (Table S-16). Evaluation of potential CASs from the Simple Importance measures yielded four tentatively identified dicrotophos CASs: trimethyl phosphonoacetate; phosphoric acid, 2-bromo-1-methylethenyl dimethyl ester; mevinphos; and phosphoric acid, dimethyl 1-methylethenyl ester. Only the first two of these compounds had been identified as CASs from the In/Out variables. The trends of these results are very similar to those for dichlorvos.

Chlorpyrifos: Classification and CAS Identification The chlorpyrifos data consisted of 67 sample aliquots from six sources and a blank. Use of the BRF method on the In/Out or Oval Area variables resulted in correct classification rates of 87% or 97%, respectively (Tables 6 and S-17). It is known that in cases with small sample sizes, automatic classifiers tend to misclassify samples from the class with fewer observations.29 In this study, the underrepresented source is PsN, which includes only seven samples due to instrument errors in data acquisition. Seven of the nine misclassifications in Table 6 and one of the two misclassifications in Table S-17 are made for PsN samples. Thus, the fact that the chlorpyrifos classification rates are