Direct Classification of GC × GC-Analyzed Complex Mixtures Using

Feb 14, 2018 - Center for Environmental Measurement and Analysis, National Institute for Environmental Studies, 16-2 Onogawa, Tsukuba, Ibaraki 305-850...
0 downloads 7 Views 1MB Size
Subscriber access provided by Universitaetsbibliothek | Johann Christian Senckenberg

Article

Direct classification of GC × GC-analyzed complex mixtures using non-negative matrix factorization based feature extraction Yasuyuki Zushi, and Shunji Hashimoto Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.7b04313 • Publication Date (Web): 14 Feb 2018 Downloaded from http://pubs.acs.org on February 16, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Analytical Chemistry is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Direct classification of GC × GC-analyzed complex mixtures using non-negative matrix factorization based feature extraction Yasuyuki Zushi 1,2* and Shunji Hashimoto 2 1

Research Institute of Science for Safety and Sustainability, National Institute of Advanced Industrial Science and Technology, Japan 2 Center for Environmental Measurement and Analysis, National Institute for Environmental Studies, Japan ABSTRACT: Complex chemical mixtures need to be evaluated to, for example, aid in medical diagnoses, assess product quality, and assess environmental conditions. Two-dimensional gas chromatography (GC × GC), which is a comprehensive analytical technique, combined with data classification techniques has attracted great interest for assessing mixtures. In this study, a nontarget cross-sample analysis-based unsupervised direct classification method using non-negative matrix factorization was developed for assessing mixtures analyzed by GC × GC. The method was developed using GC × GC data for more than 30 river water samples as image data. The retention time shift correction data processing step was important to the classification accuracy because the compound signals were found at slightly different times for different samples. The maximum likelihood estimates of the matching ratios for the 30 samples, with retention time shift correction, were 86.8% and 77.0% using two and three ranks, respectively. The method is easy to perform and intuitive, requiring no specific knowledge or labeled data. This direct classification method will therefore be particularly useful for performing initial screens of large numbers of samples and for identifying major differences between samples.

Mixtures of large numbers of chemicals need to be evaluated in varieties applications, such as when making medical diagnoses using biomarkers, checking product quality, and making environmental assessments. However, it is difficult to assess the risks posed by chemical mixtures.1 It is therefore necessary to develop methods for simplifying the assessment of chemical mixtures. Methods for comprehensively analyzing mixtures are of interest because they make it possible to decrease the complexity involved in assessing chemical mixtures. Biota tissues and environmental media contain many polar and ionic chemicals, so they are often analyzed by liquid chromatography, electrophoresis,2,3 and gas chromatography using polar stationary phases or after derivatizing nonvolatile chemicals.4 Non-polar chemicals are mainly analyzed by gas chromatography. Recent advances in twodimensional gas chromatography (GC × GC) have made it a powerful technique for comprehensively analyzing polar and semi-polar chemicals.5,6 GC × GC coupled with high resolution time-of-flight mass spectrometry allows nontarget (full-scan mode) analyses to be performed.7,8 Nontarget analyses allow “target screening” of practically unlimited numbers of chemicals to be performed using specific information (such as m/z ratios and retention times (RTs)) and “nontarget screening” of chemicals to be performed without that information (but requiring mass fragmentation patterns for groups of interest or a database containing information for a wide range of chemicals).9,10 Improving our knowledge in the target screening and identifying rules to allow chemicals of interest to be identified when performing nontarget screening are both important to the evaluation of mixtures. Along with mass spectral patterns,

chromatographic patterns can be valuable when performing nontarget analyses. Here, we focus on using chromatographic patterns to assess complex chemical mixtures. Classification techniques including pattern recognition have previously been used for target screening, nontarget screening, and other nontarget approaches to GC × GC analyses and other types of analysis (including liquid chromatography and two-dimensional capillary electrophoresis).11-13 Analytical outputs can be classified simply by visually comparing GC × GC chromatograms.14 However, there are more statistically sound approaches, including popular classification techniques for GC × GC data using the supervised-learning-based methods linear discriminant analysis and partial least square discriminant analysis (PLS-DA), performed on lists of detected compounds (peak tables) prepared from GC × GC outputs.15 In PLS-DA, samples of interest are placed in two classes (binary classification) or more (multiclass classification) based on peak tables containing chemicals detected in the samples. Using PLS-DA, the classifier used to determine the class a sample belongs to is identified using known classlabeled samples as training data. This approach therefore requires a peak assignment process and class-labeled samples (to act as training data). Support vector machine is one of the other machine learning algorithms that have been used.16 A random forest, which is also a machine learning algorithm, has been used to identify sources of organophosphate pesticides. The method identified pesticide sources very accurately (87%–100%) using class-labeled training data from a GC × GC peak table.17 Principle component analysis (PCA) is a popular unsupervised-

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

learning-based classification technique.18-21 PCA consolidates large numbers of variables (such as chemical species in a peak table) into several major variables (i.e., decreases the data dimensions), therefore consolidating the variables to represent categorical classes. Hierarchical and non-hierarchical clustering are also unsupervised classification techniques that do not require known classlabeled samples. Unsupervised methods such as PCA and clustering methods require less information than supervised methods such as PLS-DA, support vector machine, and random forest, but sometimes less accurately classify samples. However, supervised methods are not suitable when the analyst has insufficient information about the samples (e.g., for anomalous samples when investigating pollution caused by a disaster), so it is important that unsupervised classification methods are developed. Both supervised and unsupervised methods generally require the original analytical output to be arranged as peak tables. In a review article of classification using multi-dimensional chromatography data focused on characteristic-matching techniques (the word “feature”, meaning a constitutional processing unit such as a chromatographic pixel, peak, or region, in the original paper is avoided here to avoid confusion), classification methods requiring partial analytical information (such as a peak table) were called “target crosssample analyses”.22 It is more challenging to perform “nontarget cross-sample analyses”, which take into account all the analyte information in the chromatograms for a sample set and therefore require high-performance dataprocessing methods that were not available a decade ago. The processing unit types can be divided into data points (pixels), peaks, regions, and peak regions. Nontarget crosssample analyses are focused on chromatographic patterns, with additional mass spectrum information used as the chromatographic peak characteristics. Nontarget crosssample analyses usually use small processing units, so data normalization techniques such as RT shift correction are key to the classification process. A functional software module called a “smart template” has been used as a supervised classification method for GC × GC and two-dimensional liquid chromatography data with peak region processing units, i.e., “peak blobs” that form templates. This method has been used to classify GC × GC chromatograms for breast cancer tumor tissues with an accuracy of 78%, calculated using the leave-one-out cross-validation (LOOCV) method using 18 samples.23 Nontarget cross-sample classification methods using pixel processing units and the analysis of variance method have been developed for GC × GC data acquired in several different types of study.24-26 The Fisher ratio, the class-to-class variance divided by the within-class variance, has been calculated for all the pixels in chromatograms of jet fuel samples, to extract distinct chromatographic peaks. The distinct peaks were identified by PCA to allow the jet fuels to be classified into different types.24 The Fisher ratio approach has also been used to identify chemical markers of perinatal asphyxia25 and moisture damage in cacao beans.26 A similar approach, a nontarget cross-sample supervised method using PLS-DA with pixel units, has been used to classify cigarette smoke analyzed by GC × GC.27 The PLS-DA synthesized chromatograms were formed from distinct pixels for the different latent variables, i.e., consolidated variables. The latent variables had positive and negative values, so the

Page 2 of 10

output needed to be explained carefully by an expert. Similarly, negative values in an unsupervised PCA method using the pixel-based approach caused difficulties.20,25 Pixelbased unsupervised classification methods therefore still need to be improved. As mentioned above, mixture classification is sometimes required when little information is available. In the study presented here we first developed an unsupervised classification method using non-negative matrix factorization (NMF) for classifying mixtures analyzed by GC × GC. Second, we developed a framework for classifying a new sample based on the image similarities between GC × GC chromatogram patterns. Combining the factorization step (the first step) and the classification framework (the second step) allowed “direct classification” to be achieved. LOOCV was used to validate the direct classification method. The method performance for pixelbased data was evaluated, altering the chromatographic resolution and RT shift correction under masking conditions included in the procedure. This nontarget cross-sample analysis-based unsupervised direct-classification method does not require class-labeled information for the samples, and will be useful for initial screening and rapidly evaluating complex mixtures. Methods Direct classification workflow The direct classification method, involving both NMF and the image similarity process, is shown in Figure 1. First, the non-class-labeled GC × GC chromatograms used as image data were corrected and used as the NMF dataset. Pretreatment for the NMF involved the optional steps of RT shift correction, masking the chromatogram, and decreasing the chromatogram resolution, then the signal intensities of the image data were normalized. The pretreated input dataset was then processed using the NMF algorithm to extract the features, maintaining the image data format. The first part defined the image features based on the initially acquired NMF input. Classification of a new sample (the second part) involved determining the similarities between the extracted feature images and the sample image. This allowed newly introduced samples from outside the NMF input to be classified. The assigned features, the final results of the direct classification method, were then determined. Data collection The GC × GC data for the chemical mixtures were obtained by analyzing river water samples from the entire Tokyo Bay basin, Japan (Figure S-1). The sample preparation and instrumental analysis methods used are described in detail in the supporting information and elsewhere.8,10 Briefly, 1 L of a water sample was concentrated using a solid phase extraction cartridge, then analyzed using a 6890GC gas chromatograph (Agilent Technologies, Santa Clara, CA, USA) with a KT2004 GC × GC system (Zoex, Houston, TX, USA) and coupled to a JEOL JMS-T100GC high resolution mass spectrometer (JEOL, Tokyo, Japan). The total runtime was 50 min, and the modulation period was 4 s. Total ion chromatograms were used as the image data.

ACS Paragon Plus Environment

Page 3 of 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

All 34 data images (the results of the GC × GC analyses of the river water samples) are shown in Figure 2. Data pretreatment for the RT shift correction was performed using GC Image R 2.6 software (GC Image LLC, Lincoln, NE, USA) using a plugin tool.28,29 Briefly, 10 alignment points were selected for each chromatogram from the spiked internal standards and identical peaks (Table S-1). The alignment points for each chromatogram were then matched with the points for the standard chromatogram (sample ID1) in the RT shift correction algorithm. The chromatogram between two alignment points was interpolated by performing a linear interpolation for the first dimension (the x-axis) and using the Sibson natural neighbor method for the second dimension (the y-axis). Peak deformation caused by the process could be corrected if desired. Because this RT shift correction applies the interpolation, the aligned chromatogram is affected by a low risk of losing or overlooking the peaks, which might happen when a chromatogram is analyzed with the tile-grid unit. A recently developed tile-based technique can also avoid overlooking of peaks.30 The edge of the chromatographic domain was masked for the first 3.3 s and last 50 s on the x-axis and the first 0.2 s and last 0.6 s on the y-axis to exclude systematic signals such as stationary phase bleeding from the column and occasional random noise. The process was aimed at avoiding insignificant signals affecting the signals of interest. The chromatographic resolution was decreased by averaging the intensity of the selected pixel to give a new pixel. A resolution decrease ratio of 2 × 2 indicates that 2 pixels on the x-axis and 2 pixels on the y-axis were combined and averaged to give a single pixel. This procedure, similarly with the RT shift correction process, was expected to mitigate adverse effects on the feature extraction and classification processes caused by small shift in the pixel positions between GC × GC chromatograms. The intensities of the signals in a chromatogram were normalized to give a total intensity of 1 to focus on the “patterns” of the sample chromatograms. The masking, resolution decrease, and normalization processes were performed using an in-house program using R software.31 The R code for the direct classification (available as Supporting Information) runs with a 64-bit PC (CPU, Core i7−4710MQ, quad-core of 2.5 GHz; physical memory, 16 GB RAM with 1600 MHz). An increase of the sample size that exceeds the PC memory size, will prevent the calculation from being performed. NMF The NMF theory and computation method have been described elsewhere.32-34 Briefly, in NMF, the basis matrix W (n × r non-negative matrices) and the coefficient matrix H (r × p non-negative matrices) for the original matrix Y (n × p non-negative matrix) were calculated so that the sum of the loss function D (Y, WH) and the regularization function R (W, H) was minimized, as shown in equations (1) and (2). D was calculated using the Frobenius distance, which is also called the Euclidean distance. Y ≈ WH

(1)

minௐ, ு ஹ ଴ ሾ‫ܦ‬ሺܻ, ܹ‫ܪ‬ሻ + ܴሺܹ, ‫ܪ‬ሻሿ

(2)

The matrix elements of Y and WH were iteratively calculated and updated using non-negative constraints so that the metric of D decreased. The R-package NMF34 was used to perform the calculations. The chromatographic image data with the pretreatments shown in Figure 1 were used as the original matrix Y, then the matrix was decomposed. The extracted features of the images were produced as the basis matrix W, and the weightings to reconstruct the original images were produced as the coefficient matrix H. Factorization rank numbers were re-determined from the magnitudes of the sums of the coefficients for the different ranks, i.e., the basis matrix with the highest total coefficient in a sample set was rank 1.The re-determined rank number with the highest coefficient for a sample was given the classification rank of the sample. Other metrics for the classification ranks were required for new samples, so a metric for image similarity was used and is described in the next subsection. Metric for classification An image similarity metric was used to classify a sample of interest into the rank constructed using NMF (Figure 1), i.e., the metric was used to estimate the classification rank (extracted feature) for a new sample not included in the NMF input dataset. The image of the new sample and the features extracted by NMF were compared using the metric in the classification step. The image similarity was calculated in the same way as the cosine similarity, which is a normalized dot product of two vectors of an image, as shown in equation (3).

‫= ݕݐ݅ݎ݈ܽ݅݉݅ݏ ݁݃ܽ݉ܫ‬ ܼ ୃ ൮ቌ



ට௓ೣమ ା௓೤మ

ቍ⨂ቌ



ට௓ೣమ ା௓೤మ

ቍ൲

(3)

where Zx is the image vector of the extracted features or training data and Zy is the image vector of the test data. Z is the matrix of Zx and Zy. Results and discussion Determination of the number of ranks using the NMF Estimating the optimal number of ranks is important to the NMF calculation used in the first part of the method. Several guidelines have been developed for determining the number of ranks, including hierarchical clustering and using a scree plot of the residual sum of squares for the input data and NMF estimates. The optimal number of ranks for the studied dataset was determined using both the clustering and scree plot methods (Figure S-2). The scree plot showed that the residual sum of squares decreased little when the number of ranks was >3. The clustering method gave two distinct sample size clusters, based on the Euclidean distances, but a third cluster was also informative. These results indicated that three ranks could be used for the dataset but that two ranks should be simultaneously tested. The NMF results

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

found using three ranks are shown in Figure 3. Refer to Figure S-1 for the sample IDs in the Figure 3. The extracted features and coefficients for each sample, which determined the NMF ranks, are shown in the figure. The extracted features for each rank were directly comparable with the NMF input images because NMF produced realistic output because of the non-negative constraints of the matrix decomposition algorithm (which are not used in the traditional PCA algorithm). Figure S-3 shows the results obtained by traditional PCA. Owing to the negative values in the PC loadings and the PC score, direct comparison between the input and PCA output was impractical. The NMF advantage of the direct comparison between the extracted features with the GC×GC image data allow us to clip the GC×GC-MS data by extracting features of rank 1, rank 2, and rank 3. Briefly, this analysis combined with the NIST MS search of clipped and emphasized chromatographic peaks allowed us to know that samples assigned as rank 1 mainly contained alkanes, terpenes, aromatics, and high levels of unresolved complex mixtures. Additionally, the samples assigned as rank 2 contained pesticides and organophosphorus flame retardants and some of the rank 1 compounds: alkanes, terpenoids, and aromatics. The sample assigned as rank 3 contained fewer species, including terpenoids, aromatics, and polymer stabilizers (See Figure S-4 for the details). For effective use of the MS information, tensor decomposition for direct classification of the whole GC×GC-MS data would be one approach. Although the NMF extracted several visible artifacts, such as peak tailing (e.g., a concavity in the extracted feature of rank 1 in Figure 3), most of the obtained features were compound peaks. The optimal number of ranks was also surveyed using a different approach, examining the correlations between the NMF results and other available information on the sample origins (i.e., the land uses in the watersheds of the samples). The correlation between the sample coefficients calculated by NMF, using a rank number of three, and the land uses was calculated by performing canonical correlation analysis (CCA), and the results are shown in Figure S-5. The rank number tests indicated that no significant correlations were found using either three or two ranks. Therefore, CCA was not useful for determining the rank number in this case. For further discussion, see the section “Exploration of optimal number of ranks by CCA” in the Supporting Information. Validity of the image similarity The image similarity metric used in the second part of the method was evaluated to see if it could correctly match the rank determined from the NMF coefficient. Using two ranks, all the sample images (using a resolution decrease ratio of 2 × 2) were correctly assigned by the image similarity method using the NMF coefficients for the NMF dataset without RT shift correction (Table S-2). The mismatch percentages were 2.9% and 8.8% when resolutions of 1 × 1 and 3 × 3, respectively, were used. Decreasing the resolution was a reasonable approach to decreasing the calculation time (a decrease of approximately one-fifth when the resolution was changed from 1 × 1 to 3 × 3), and mitigated small shifts in the pixel positions that could affect the feature extraction accuracy when RT shift correction was not performed. Using three ranks, the mismatch percentage

Page 4 of 10

was 14.7% for resolutions of both 1 × 1 and 2 × 2. Using four ranks increased the mismatch further, which we concluded was caused by extracting subtle differences in the original images. Determining the optimal rank is therefore important to the direct classification method. Performing RT shift correction on an NMF dataset as a pretreatment gave the lowest mismatch (11.8% for three ranks) using a resolution of 1 × 1. However, the mismatch percentage was higher (at 20.6%) for resolutions of 2 × 2 and 3 × 3 using three ranks. Using RT shift correction therefore allows the resolution decreasing process to be skipped for other numbers of ranks (Table S-3). Decreasing the resolution was still expected to improve the calculation cost for comparing results obtained under varying conditions, so a resolution of 2 × 2 was used in the subsequent comparative evaluation. Image pretreatment performance The direct classification of a new sample using the second part of the method was replicated using the LOOCV method, and the performance was evaluated using several pretreatment conditions. The LOOCV procedure is described next. First, a test image sample was chosen from all the image data, and NMF was performed on the rest of the sample dataset. Next, the classification rank of the test image sample was determined from the image similarity between the extracted features of the ranks determined by NMF and the test image. The process was repeated until classification ranks for all the sample were determined using the LOOCV method (i.e., the LOOCV test was performed 34 times). The classification ranks determined by LOOCV and by NMF were compared to determine whether the direct classification method classified the samples accurately. The comparison was performed using different conditions, including different image resolutions and rank numbers and with and without RT shift correction. The matching ratios, the results of the comparisons using the different conditions, are shown in Table 1. The performance of the pretreatment to decrease the resolution was not clear, but RT shift correction increased the matching ratio in most cases, especially using two ranks. Features emphasizing subtle differences between samples, such as in the chromatogram background, were extracted when >4 ranks were used (Figure S-6), meaning that the image similarity metric results were disjointed and the evaluation was meaningless. In summary, RT shift correction improved the performance of the direct classification method using the optimal number of ranks. Decreasing the resolution did not improve the matching ratio when RT shift correction was performed, but had the advantage of improving the calculation cost. All the RTshift-corrected data and classification ranks determined from the image similarities are shown in Figure S-7. Matching ratio and the sample size The sample size used for NMF was important to achieving successful feature extraction and classification in both the first and second parts of the method. The matching ratios for the LOOCV tests and NMF results were therefore calculated, increasing the number of samples used for the NMF input with a resolution of 2 × 2 (Figure 4). The dataset was increased by adding a new sample (randomly selected from the full range of samples). The sequential calculation was repeated 50 times, using a different initial sample set

ACS Paragon Plus Environment

Page 5 of 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

each time. A logistic regression model was fitted to the data to focus on the trends and to estimate the optimal number of samples to use for the NMF-based feature extraction. Using either two or three ranks, RT shift correction improved the matching ratio as the sample size increased and decreased the matching ratio deviation. The maximum likelihood estimates from logistic regression for 30 samples with RT shift correction were 86.8% and 77.0% for two and three ranks, respectively. The regressions predicted that 47 and 58 samples would give a 90% match for two and three ranks, respectively, and 124 samples would give a 99% match for three ranks. A dataset with a large number of samples would therefore offer great advantages in terms of the direct classification accuracy. Conclusions and perspective A nontarget cross-sample analysis-based unsupervised direct-classification method for evaluating samples of mixtures analyzed by GC × GC was developed. The NMF used in the first part of the method was used to extract features from a mixture dataset with difficult-to-label characteristics. It was therefore important to estimate the optimal number of ranks to avoid information being lost and to avoid the NMF-based feature extraction process resulting in overfitting. The RT shifts found in different instrumental runs meant that RT shift correction was important for accurate classification to be achieved. Direct classification offers the advantage that, once the optimal number of ranks for NMF has been determined, newly introduced samples can be rapidly classified by directly comparing them with the extracted features through the image similarity evaluation process used in the second part of the method. No expert knowledge is required to perform this nontarget-based classification method. The results can be intuitively checked by performing a visual inspection of the images produced. The constituents of mixtures, especially mixtures in environmental media, can be very complex, so it is often important to have available a classification method that does not require labeled samples, has low computational costs, and is not labor intensive. The method presented here includes inferior performance when extracting subtle differences and target peaks, and gave a poorer matching accuracy than can be achieved using supervised classification methods.17,35 However, the cost saving approach of directly comparing extracted features by performing an image similarity evaluation would be preferable in practice. The advantages offered by the low calculation costs and improved classification accuracy will depend on the actual situation. The method is easy to perform and intuitive, and does not require specific knowledge or class-labeled training data, so the direct classification method will be advantageous especially for the initial screening of large numbers of samples and for identifying major differences between samples. The method for GC × GC analysis data is expected to be useful for a wide range of applications, such as checking the qualities of manufactured products, identifying anomalies after disasters, identifying sources of pollutants and other materials, diagnosing health problems from biomarkers, and identifying potentially hazardous mixtures. In summary, the method will be useful for rapidly screening complex

mixtures analyzed by GC × GC and is expected to have applications for data acquired using other analytical methods.

Figure 1. Procedure flowchart for the direct classification method The direct classification method involves feature extraction based on non-negative matrix factorization (NMF), the first part, and an image similarity process, the second part. The bold arrows indicate the recommended procedural flow.

Figure 2. Original chromatograms for the samples used in the direct classification method

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The sample name (ID 1 to ID 34) is shown at the top left of each chromatogram. The axes of each chromatogram show the RT of the first GC (8 to 50 min) on the x-axis and the RT of the second GC (0.6 to 3.8 s) on the y-axis. The color axis (blue to red) represents the normalized intensity values ranging from 0 to 1.

ACS Paragon Plus Environment

Page 6 of 10

Page 7 of 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Figure 3. Non-negative matrix factorization (NMF) matrix (chromatograms of the extracted features) and coefficients The extracted features for NMF ranks 1–3 are shown on the left, and the coefficients (contributions) for the different samples are shown on the right. The axes of the extracted features show the RT of the first GC (8 to 50 min) on the x-axis and the RT of the second GC (0.6 to 3.8 s) on the y-axis. The color axis (blue to red) represents the normalized intensity values ranging from 0 to 1.

Table 1. Matching ratios for the classification rank determination using the image similarity metric and the non-negative matrix factorization results (%) Chromatogram Resolution Decrease ratio; 1 × 1 (Original)

Decrease ratio; 2×2

Decrease ratio; 3×3

Not RT shift Not RT shift Not RT shift corrected corrected corrected corrected corrected corrected Rank: 2

100.0

97.1

100.0

73.5

97.1

82.4

Rank: 3

64.7

79.4

79.4

82.4

76.5

55.9

Leave-one-out cross validation (LOOCV) was performed during the classification rank determination process using the image similarity to evaluate the classification performance for a new sample.

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 4. Leave-one-out cross validations (LOOCVs) for evaluating the performance of the classification method when increasing the number of samples A random sample was used for each LOOCV test (each dot). The procedure involved adding the random samples one-by-one

Page 8 of 10

to the initial sample set (the test types are indicated with dashed lines). The tests were repeated 50 times with different initial sample sets, and the series are shown as different colored dashed-lines. Logistic regression curves with 95% confidence intervals are shown.

ACS Paragon Plus Environment

Page 9 of 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

ASSOCIATED CONTENT Supporting Information Figures S-1–S-7 and Tables S-1–S-3 are available as supporting information. This material is available free of charge via the Internet at http://pubs.acs.org. The R code of the direct classification is available at https://github.com/Yasuyuki-Zushi/GCxGC-NMFClassification.

AUTHOR INFORMATION Corresponding Author * Yasuyuki Zushi National Institute of Advanced Industrial Science and Technology, 16-1 Onogawa, Tsukuba, Ibaragi 305-8506, Japan Tel: +81-29-861-2970 E-mail: [email protected]

Notes The authors declare no competing financial interests.

ACKNOWLEDGMENTS This study was supported by a Grant-in-Aid for Young Scientists (A) (grant no. 15H05340) and partially by a Grant-in-Aid for Scientific Research (A) (grant no. 17H00796), parts of the JSPS KAKENHI programs. We thank Dr. Jonas Gros and Dr. Samuel Arey for valuable advice to share our code.

REFERENCES (1) Kortenkamp, A.; Backhaus, T.; Faust, M. 2009, 2009 391 p. (2) Brack, W.; Ait-Aissa, S.; Burgess, R. M.; Busch, W.; Creusot, N.; Di Paolo, C.; Escher, B. I.; Mark Hewitt, L.; Hilscherova, K.; Hollender, J.; Hollert, H.; Jonker, W.; Kool, J.; Lamoree, M.; Muschket, M.; Neumann, S.; Rostkowski, P.; Ruttkies, C.; Schollee, J.; Schymanski, E. L., et al. Sci. Total Environ. 2016, 2016 544, 1073-1118. (3) Görg, A.; Weiss, W.; Dunn, M. J. Proteomics 2004, 2004 4, 3665-3685. (4) Weber, W.; Andersson, J. T. Anal. Bioanal. Chem. 2014, 2014 406, 5347-5358. (5) Giddings, J. C. Anal. Chem. 1984 1984, 84 56, 1258A-1270A. (6) Mondello, L. In Wiley Sereis on Mass Spectrometry; John Wiley & Sons, Inc., 2011, p 496. (7) Ochiai, N.; Ieda, T.; Sasamoto, K.; Fushimi, A.; Hasegawa, S.; Tanabe, K.; Kobayashi, S. J. Chromatogr. A 2007, 2007 1150, 13-20. (8) Zushi, Y.; Hashimoto, S.; Tanabe, K. Chemosphere 2016, 2016 156, 398-406. (9) Krauss, M.; Singer, H.; Hollender, J. Anal. Bioanal. Chem. 2010, 2010 397, 943-951. (10) Zushi, Y.; Hashimoto, S.; Tamada, M.; Masunaga, S.; Kanai, Y.; Tanabe, K. J. Chromatogr. A 2014, 2014 1338, 117-126. (11) Berrueta, L. A.; Alonso-Salces, R. M.; Héberger, K. J. Chromatogr. A 2007, 2007 1158, 196-214. (12) Kjærsgård, I. V. H.; Nørrelykke, M. R.; Jessen, F. Proteomics 2006, 2006 6, 1606-1618.

(13) Beltrán, N. H.; Duarte-Mermoud, M. A.; Salah, S. A.; Bustos, M. A.; Peña-Neira, A. I.; Loyola, E. A.; Jalocha, J. W. J. Food Eng. 2005, 2005 67, 483-490. (14) Cardeal, Z. L.; de Souza, P. P.; Silva, M. D. R. G. d.; Marriott, P. J. Talanta 2008, 2008 74, 793-799. (15) Pierce, K. M.; Kehimkar, B.; Marney, L. C.; Hoggard, J. C.; Synovec, R. E. J. Chromatogr. A 2012, 2012 1255, 3-11. (16) Stanimirova, I.; Üstün, B.; Cajka, T.; Riddelova, K.; Hajslova, J.; Buydens, L. M. C.; Walczak, B. Food Chem. 2010, 2010 118, 171-176. (17) Strozier, E. D.; Mooney, D. D.; Friedenberg, D. A.; Klupinski, T. P.; Triplett, C. A. Anal. Chem. 2016, 2016 88, 7068-7075. (18) Qiu, Y.; Lu, X.; Pang, T.; Zhu, S.; Kong, H.; Xu, G. J. Pharmaceut. Biomed. 2007, 2007 43, 1721-1727. (19) Welke, J. E.; Manfroi, V.; Zanus, M.; Lazzarotto, M.; Alcaraz Zini, C. Food Chem. 2013, 2013 141, 38973905. (20) McGregor, L. A.; Gauchotte-Lindsay, C.; Nic Daéid, N.; Thomas, R.; Kalin, R. M. Environ. Sci. Technol. 2012, 2012 46, 3744-3752. (21) Mitrevski, B.; Veleska, B.; Engel, E.; Wynne, P.; Song, S. M.; Marriott, P. J. Forensic Sci. Int. 2011, 2011 209, 11-20. (22) Reichenbach, S. E.; Tian, X.; Cordero, C.; Tao, Q. J. Chromatogr. A 2012, 2012 1226, 140-148. (23) Reichenbach, S. E.; Tian, X.; Tao, Q.; Ledford Jr, E. B.; Wu, Z.; Fiehn, O. Talanta 2011, 2011 83, 1279-1288. (24) Johnson, K. J.; Synovec, R. E. Chemometr. Intell. 2002 60, 225-237. Lab. 2002, (25) Beckstrom, A. C.; Humston, E. M.; Snyder, L. R.; Synovec, R. E.; Juul, S. E. J. Chromatogr. A 2011, 2011 1218, 1899-1906. (26) Humston, E. M.; Knowles, J. D.; McShea, A.; Synovec, R. E. J. Chromatogr. A 2010, 2010 1217, 19631970. (27) Gröger, T.; Welthagen, W.; Mitschke, S.; Schäffer, M.; Zimmermann, R. J. Sep. Sci. 2008, 2008 31, 33663374. (28) Gros, J.; Nabi, D.; Dimitriou-Christidis, P.; Rutler, R.; Arey, J. S. Anal. Chem. 2012, 2012 84, 9033-9040. (29) Zushi, Y.; Gros, J.; Tao, Q.; Reichenbach, S. E.; Hashimoto, S.; Arey, J. S. J. Chromatogr. A 2017, 2017 1508, 121-129. (30) Parsons, B. A.; Marney, L. C.; Siegler, W. C.; Hoggard, J. C.; Wright, B. W.; Synovec, R. E. Anal. Chem. 2015, 2015 87, 3812-3819. (31) R Development Core Team. R Development Core Team, https://cran.r-project.org/. (32) Lee, D. D.; Seung, H. S. Nature 1999, 1999 401, 788791. (33) Zushi, Y.; Hashimoto, S.; Tanabe, K. Anal. Chem. 2015, 2015 87, 1829-1838. (34) Gaujoux, R.; Seoighe, C. BMC Bioinformatics 2010, 2010 11:367. (35) Reichenbach, S. E.; Carr, P. W.; Stoll, D. R.; Tao, Q. J. Chromatogr. A 2009, 2009 1216, 3458-3466.

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 10

For TOC only

ACS Paragon Plus Environment

10