Application of a deep neural network to metabolomics studies and its

rounds, the learning rate, and the number of nodes in the hidden layers) were ..... tim e s. Page 11 of 18. ACS Paragon Plus Environment. Analytical C...
0 downloads 0 Views 739KB Size
Subscriber access provided by READING UNIV

Article

Application of a deep neural network to metabolomics studies and its performance in determining important variables Yasuhiro Date, and Jun Kikuchi Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.7b03795 • Publication Date (Web): 26 Dec 2017 Downloaded from http://pubs.acs.org on December 26, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Analytical Chemistry is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Application of a deep neural network to metabolomics studies and its performance in determining important variables Yasuhiro Date1,2,*, and Jun Kikuchi1,2,3,* 1

RIKEN Center for Sustainable Resource Science, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan. 2

Graduate School of Medical Life Science, Yokohama City University, 1-7-29 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan.

3

Graduate School of Bioagricultural Sciences, Nagoya University, 1 Furo-cho, Chikusa-ku, Nagoya, Aichi 464-8601, Japan.

Corresponding Author: *Center for Sustainable Resource Science, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan. Tel: +81455039439; Fax: +81455039489; Email: [email protected] (Y.D.); [email protected] (J.K.)

Abstract Deep neural networks (DNNs), which are kinds of the machine learning approaches, are powerful tools for analyzing big sets of data derived from biological and environmental systems. However, DNNs are not applicable to metabolomics studies because they have difficulty in identifying contribution factors, e.g., biomarkers, in constructed classification and regression models. In this paper, we describe an improved DNN-based analytical approach that incorporates an importance estimation for each variable using a mean decrease accuracy (MDA) calculation, which is based on a permutation algorithm; this approach is called DNN-MDA. The performance of the DNN-MDA approach was evaluated using a data set of metabolic profiles derived from yellowfin goby that lived in various rivers throughout Japan. Its performance was compared with that of conventional multivariate and machine learning methods, and the DNN-MDA approach was found to have the best classification accuracy (97.8%) among the examined methods. In addition to this, the DNN-MDA approach facilitated the identification of important variables such as trimethylamine N-oxide, inosinic acid, and glycine, which were characteristic metabolites that contributed to the discrimination of the geographical differences between fish caught in the Kanto region and those caught in other regions. As a result, the DNN-MDA approach is a useful and powerful tool for determining the geographical origins of specimens and identifying their biomarkers in metabolomics studies that are conducted in biological and environmental systems.

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Introduction Machine learning, as typified by deep learning, impacts diverse aspects of modern-day societies and is widely used for various systems, ranging from search engines on the World Wide Web to commercial products such as personal computers and smartphones.1 Machine learning has received a lot of attention from a variety of scientific fields, encompassing not only information and computer sciences but also chemistry and biology. One of the fields of chemistry and biology in need of technologies related to machine learning is metabolomics. Metabolomics is a profiling technique that facilitates comprehensive evaluation of a large variety of metabolic information derived from complex biological reactions. Nuclear magnetic resonance (NMR) is a key technology for metabolomics studies, and it has several advantageous features, such as robustness, reproducibility, and interlaboratory compatibility;2,3 it can also be used to obtain position-dependent information in live cells through the use of isotopomer analyses.4 NMR-based metabolomics have been utilized in a wide variety of research efforts, such as in real-time metabolomic monitoring of live cancer cells,5 in a study on fine-needle aspiration specimens of thyroid nodules,6 in the metabolic profiling of Botswanan soil,7 and in a noninvasive analysis of the metabolic changes caused by nutrient intake in diverse species of fish.8 NMR-based metabolomics studies have been assisted by a variety of analytical tools and useful databases such as SpinAssign,9 SpinCouple,10 BMRB,11 HMDB,12 TOCCATA,13,14 Birmingham Metabolite Library,15 NMRShiftDB,16 BATMAN,17 MetaboAnalyst,18 MVAPACK,19 fragment-assembly approach,20 statistical total correlation spectroscopy,21 market basket analysis,22 and signal enhancement by spectral integration.23 The partial least squares (PLS) method is most commonly used in the data mining step in metabolomics studies for supervised data analyses (i.e., classification and regression); however, machine learning approaches such as random forest (RF)24 and support vector machines (SVMs)25 have been used as alternative data mining methods.26 These machine learning approaches have been successfully applied to metabolomics studies, e.g., in the evaluation of the classification performance carried out on healthy subjects and patients suffering from Streptococcus pneumoniae,27 in the performance evaluation of six binary classification algorithms,28 in the determination of the geographical origin of medicinal herbs,29 in the discovery of metabolic biomarkers for devil facial tumor disease,30 in the metabolic profiling of patients suffering from chronic obstructive pulmonary disease,31 in the screening of metabolic biomarkers in patients that had Crohn's disease,32 in the defining of the metabolic signature of patients with Celiac disease,33 as well as in the evaluation of the performance of feature extraction by the knowledge discovery by accuracy maximization (KODAMA) algorithm.34 However, in our metabolomics studies we have come across a case that could not be analyzed by the conventional PLS method (see the Results and discussion section for more details); an alternative approach for this analysis was therefore required. As a result, this study describes our attempts to circumvent the difficulties of the classification problem that the model was unable to be

ACS Paragon Plus Environment

Page 2 of 18

Page 3 of 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

appropriately construct through the PLS method. We focused our efforts on deep neural networks (DNNs), which are a type of deep learning (and a neural network-based approach), to solve our classification problem. However, typical DNN algorithms are unable to identify the contribution factors (i.e., the important variables) for the classification and regression model because of the highly complex computations involved in the model generation steps; this is problematic, as metabolomics studies are often requested to identify important variables (such as biomarkers) in which the differences between classes can be characterized. The DNN approach therefore needed to be optimized before it could be applied to metabolomics studies. In the present paper, we describe an improved DNN-based analytical approach that enables the estimation of the importance of each variable; this was accomplished through the incorporation of a mean decrease accuracy (MDA) calculation that was based on a permutation algorithm. The approach was named DNN-MDA. The DNN-MDA approach was then applied to a performance evaluation experiment in which typical machine learning methods, namely PLS, RF, and SVM, were compared against. The classification problem that was investigated was that of the geographical origins of yellowfin goby collected from various rivers throughout Japan. Materials and methods Data preparation This study used two types of NMR spectral data sets derived from the muscle metabolites of yellowfin goby (Acanthogobius flavimanus). One of these was the NMR spectra of water-soluble components (n = 170) that we reported upon in a previous study.35 The data set of the water-soluble components was only used for the hyperparameter estimation of our DNN-MDA approach. The other data set used was the NMR spectra of methanol-soluble components (n = 1022) that was reported upon in a previous study;36 the data processing of two-dimensional 1H J-resolved NMR spectra (n = 1022) in our study was performed as per their work. In brief, the NMR data were both automatically and manually phased, zero filled, and baseline collected on Bruker TopSpin software (Bruker BioSpin GmbH, Rheinstetten, Germany). The NMR peaks were selected using rNMR software,37 and a total of 106 variables were obtained. The peak-picked data was processed by probabilistic quotient normalization with scaling and centering. Construction of DNN-MDA The DNN-MDA algorithm was developed on the R platform.38 The DNN-MDA algorithm was composed mainly of three parts: data segmentation for the k-fold cross validation (CV), computation of the neural network using the mxnet library,39 and calculation of the variable importance (VI), which was based on a permutation algorithm (Fig. 1). For the data segmentation, a repeated k-fold CV (in which the number of repeats = 20) was used in order to prevent over-fitting. Before the performance evaluation, the impact of the k value of the k-fold CV on the classification accuracy of the DNN-MDA models was analyzed (Fig. S1). From this analysis, we determined that the

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 18

classification performance was hardly affected by this parameter, aside from when small values were used; k = 13 was found to result in the best classification performance for this data set, and so it was used. In the computation using the mxnet library, the hyperparameter values (e.g., the number of rounds, the learning rate, and the number of nodes in the hidden layers) were determined using the data set of water-soluble components (n = 170), and these values were used for the analysis of the data set of methanol-soluble components (n = 1022) (see the Results and discussion section). For the VI calculation, the values in a variable were randomly rearranged among the samples (this was the permutation), and the rearranged data matrices were evaluated by the constructed DNN model. The discrimination accuracies obtained by the permutations were compared to the model accuracy by the following equation: VI =

 

∑ (ACC − ACCP ),

(1)

where ACC represents the discrimination accuracy of the constructed model (before the rearrangements), ACCP represents the discrimination accuracies calculated using the rearranged data matrices (i.e., these were the permutations), and m is the number of permutations per variable. In this study, the permutations were repeated 50 times (i.e., m = 50) for each variable. In Eq. (1), a relatively small VI value (i.e., a near-zero value) means the constructed model was rarely influenced by the variable (i.e., the variables had low importance for the constructed model), while a relatively large VI value indicates that the constructed model was significantly affected by the variables (i.e., they had high importance for the constructed model). Based on this criterion, which is called MDA, we evaluated the importance of the variables in the constructed DNN model (i.e., the contribution factors for the model). In addition to this, the square of the value before the summation was also adopted in the MDA calculation because the square value may enhance the sensitivity in detection of the pertinent biomarkers in some cases. The DNN-MDA algorithm developed in this study was deposited in our website (http://dmar.riken.jp/Rscripts/). Data analysis PLS, RF, and SVM were performed on a R platform using the pls,40 randomForest,41 and classyfire42 libraries, respectively. For calculation of SVM-MDA, the e1071 package was also used. The number of components for the PLS models were set to 8–24 (average value of 11.7 ± 2.16); the exact number depended on the model used. For the RF models, the parameters mtry and ntree were set to 13 and 500, respectively. For the SVM models, the optimal gamma and cost parameters were determined by the classyfire algorithm or using the 'tune.svm' function in the e1071 package. The validation of the PLS and RF models was performed by a repeated k-fold CV in the same way as that done for the DNN-MDA approach. The validation of the SVM was performed by the classyfire algorithm. Depicting the receiver operating characteristic (ROC) curves and calculating the area under the curve (AUC) was performed on a R platform using the ROCR package. Significant differences between two groups were sought using Welch’s t test with Bonferroni correction.

ACS Paragon Plus Environment

Page 5 of 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Results and discussion Parameter optimization of the DNN-MDA algorithm Before the performance of the DNN-MDA algorithm was evaluated, the hyperparameter values of the DNN were optimized using the data set of the water-soluble metabolites that were derived from the muscles of the yellowfin gobies. All of the fish samples in the water-soluble data set (n = 170) were included in the methanol-soluble data set for the performance evaluation (n = 1022). Using the water-soluble data set, we evaluated what effects varying the hyperparameter values had on the classification accuracy of the DNN models; these evaluations were done in terms of geographical differences (the yellowfin gobies were collected from the Tsurumi and Tama rivers in Japan) (Figs. S2 and S3). A parameter, the number of rounds, barely affected the classification performance of the DNN algorithm, although a large value for this was found to be likely to induce high variability; this was possibly due to over-fitting (Fig. S2A). This result indicates that the number of rounds that ranged from 30~200 were better setting parameters for the DNN models due to their low computational costs, and so the number of rounds was set to 30 in this study. The learning rate was found that this significantly affected the classification performance of the DNN algorithm, and the values ranging from 0.05~0.1 were found to be the most suitable (Fig. S2B). As a result, the learning rate for the performance evaluation was set to 0.07. The array batch size was also found to have affected the classification performance of the DNN algorithm, and values ranging from 20~100 were found to be the most suitable for it (Fig. S2C); this study therefore adopted 30 as the array batch size for this experiment. The effect of variations in the number of nodes in the hidden layers on the classification performance of the DNN-MDA was also evaluated (Fig. S3). Our analysis showed that this barely affected the classification accuracy of the DNN models, except for when the number of nodes was small (1~5); this study therefore used 200 as the number of nodes in two hidden layers. Comparison of the classification performance of the DNN-MDA algorithm with other classification approaches In order to evaluate the classification performance of the DNN-MDA algorithm, 1022 samples of yellowfin goby that had been collected from various rivers in Japan were used as a data set. In order to simplify the classification problem, a binary classification problem centered on the geographical origins of the fish (i.e., Kanto (n = 583) and other regions (n = 439)) was used for the evaluation of the DNN-MDA and the other methods (PLS, RF, and SVM). In this data set, a principal component analysis (PCA) was performed in order to understand the characteristics of the data (Fig. S4A). The PCA score plot indicated that there were few differences between fish obtained from Kanto and from other regions. A similar finding was observed in the analysis by PLS, although PLS was more likely than PCA to improve the clustering of the data (Fig. S4B). These results indicate that it was difficult to extract any features and find any differences between the two regions when PCA and the PLS-based method were used. This led us to determine that alternative methods, such as machine

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

learning approaches, are necessary for characterizing (i.e., discriminating) the geographical differences between the metabolic profiles of the samples in the data set. The classification performances of the DNN-MDA, SVM, RF, and PLS methods were evaluated so that they could be compared (Fig. 2). The average classification accuracies were found to be 57.3% for PLS, 95.8% for SVM, 95.0% for RF, and 97.8% for DNN-MDA. The approaches that relied solely on machine learning (i.e., SVM, RF, and DNN-MDA) made accurate classifications for this binary problem, whereas the PLS method was not able to discriminate between the two groups well enough. To verify the constructed classification models, ROC curves and the AUCs were analyzed for the DNN-MDA, RF, and SVM models (Fig. 3), indicating that reliable group separations were performed in each model. The DNN-MDA algorithm performed the best in terms of classification accuracy, which suggests that this approach is a useful and powerful tool for discriminating between geographical differences in metabolomics data. Important variables Using the DNN-MDA algorithm, we calculated the VI of the data contributed to the constructed model along with identifying important variables, such as trimethylamine N-oxide (TMAO), inosinic acid (IMP), and glycine, as characteristic metabolites (i.e., biomarkers) that contributed to the discrimination of the geographical differences between fish caught in Kanto and in other regions (Fig. 4). In the case of the square value, a variable derived from TMAO, histidine, and phosphatidylcholine (PC) was emphasized (Fig. S5A and S5B). In addition, we also calculated the VI using the SVM-MDA algorithm, indicating that the important variables identified by SVM-MDA were almost the same with those by DNN-MDA (Fig. S5C). This result also indicated that MDA is a versatile approach to identification of important variables in various machine learning methods. In order to validate the calculated VI, a significance test was performed between the two sets of data for the identified important variables (Fig. 5). This significance test indicated that most of the metabolites identified as important variables were either significantly more abundant or scarce in the muscles of the yellowfin gobies derived from the Kanto region. The significant differences indicated that the identified important variables were distinguishing metabolites characterizing for differences between the Kanto and the other regions. Thus, the DNN-MDA algorithm could successfully identify characteristic metabolites for the Kanto region and provide a highly accurate classification of geographical differences. Limitations of the DNN-MDA approach Although the DNN-MDA algorithm is a useful tool for the classification and regression of various subjects, such as in the discrimination of the geographical origin of specimens from their metabolic profiles, there are limitations in some situations. One of the limitations is that the classification performance of the DNN-MDA method depends on sample size. In order to evaluate how sample size affects the classification accuracy of the DNN-MDA algorithm, we systematically decreased the

ACS Paragon Plus Environment

Page 6 of 18

Page 7 of 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

number of samples in the data set that would be analyzed (Fig. 6). When a small sample size (e.g., 50 samples) was used for the computation of the DNN-MDA algorithm, the average classification accuracy drastically decreased, falling to less than 80%; furthermore, the variability of the accuracy was greater. The DNN-MDA approach therefore requires relatively large numbers of samples (at least 200 samples in the data set we investigated) in order to achieve high classification performances (over 90%). In addition, we have evaluated the relationship between the sample size and the identified important variables (Fig. S6). From this result, the identified important variables were largely varied when less than 200 samples were used for the calculation. In contrast, the identified important variables were almost the same with full dataset when 300 and 500 samples were used for the calculation (Fig. S6D and S6E). Thus, a sufficient sample size is required for construction of robust classification models with identification of 'true' important variables in DNN-MDA approach. In cases where only small sample sizes are available, the SVM algorithm may achieve higher classification performances than the DNN-MDA algorithm, because SVM has previously been reported to perform adequately well for the classification and biomarker identification of a small number of biological replicates in metabolomics studies.43 The DNN-MDA approach may possibly have inferior performances when small sample sizes are used due to a learning failure. We also evaluated the classification performance of the DNN-MDA approach for a biased condition in the data set (i.e., a biased sample balance between the two groups); we compared its performance in this to that of the other methods (i.e., PLS, RF, and SVM). From the overall data set (n = 1022), we evaluated whether the methods could distinguish the metabolic features of yellowfin goby captured only in the Chugoku region (n = 47); i.e., the classification problem looked to discriminate only a limited number of targets from a set consisting of a large amount of non-relevant data. The metabolic profiles derived from the Chugoku region were visualized on a PCA score plot (Fig. S4C), which indicated that the features of the Chugoku region were obscured by the metabolic profiles derived from the other regions; this meant that the PCA was unable to determine the metabolic features of fish from the Chugoku region. The machine learning approaches were applied to this classification problem, and the resulting confusion matrices and average classification performance indicators (i.e., sensitivity, specificity, and accuracy) are listed in Table 1. With regards to the classification accuracy, all of the methods showed high performance levels (over 95%), with the DNN-MDA algorithm performing best. However, the PLS model categorized all of the samples, including those from the Chugoku region, as being from non-Chugoku regions; as such, it was unable to capture any metabolic features from the Chugoku region. The best classification performance (i.e., sensitivity) was achieved by SVM, which satisfactorily classified 97% of the samples from the Chugoku region. The DNN-MDA approach also performed well, registering 90.2% sensitivity. These results indicate that the performance of SVM is slightly superior to that of the DNN-MDA algorithm when it comes to highly biased data sets. This limits the applicability of the DNN-MDA approach to certain classification problems, as other methods are more suitable in cases

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

where only a small sample size or a highly biased data set is used. Conclusion We have constructed a DNN-MDA approach with which supervised classification and regression modeling can be performed and important variables for the evaluation of biological and environmental samples can be determined. To the best of our knowledge, this study is the first to introduce a DNN algorithm that provides indicators of variable importance in metabolomics studies. In the geographical discrimination of yellowfin goby derived from Kanto and other regions, the DNN-MDA approach showed the best classification performance from the various machine learning methods tested, whereas the conventional PLS method was unable to construct a helpful discriminant model for this data set. Additionally, the impacts of sample size and biased samples between two classes on the classification performance of the DNN-MDA approach were evaluated, and it was found that there may be limitations to the usefulness of this approach. This study provides a guideline for choosing which machine learning method to use under which circumstances (i.e., when the characteristics of the data sets being investigated are taken into consideration). From this point of view, the DNN-MDA approach is likely to be suitable in cases where the number of samples used is over about 200 and the differences between the number of samples in the different classes in the set are relatively low. As a result, the DNN-MDA approach is expected to be a useful and powerful tool for classification and regression modeling and for the identification of biomarkers in not only metabolomics, but also in other omics-based studies, such as genomics and proteomics.

Acknowledgements The authors wish to thank Feifei Wei and Kenji Sakata (RIKEN) for data preparations and metabolite annotations used in this study. Author Contributions The manuscript was written by Y.D. and reviewed by J.K. All the authors have approved the final version of the manuscript. Supporting Information Additional information cited in the text is available free of charge via the Internet at http://pubs.acs.org. The supporting information contains six artworks including parameter optimization results, PCA, and validation of important variables (PDF). References

ACS Paragon Plus Environment

Page 8 of 18

Page 9 of 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

(1) LeCun, Y.; Bengio, Y.; Hinton, G. Nature 2015, 2015 521, 436-444. (2) Viant, M. R.; Bearden, D. W.; Bundy, J. G.; Burton, I. W.; Collette, T. W.; Ekman, D. R.; Ezernieks, V.; Karakach, T. K.; Lin, C. Y.; Rochfort, S.; De Ropp, J. S.; Teng, Q.; Tieerdema, R. S.; Walter, J. A.; Wu, H.

Environ Sci Technol 2009, 2009 43, 219-225. (3) Ward, J. L.; Baker, J. M.; Miller, S. J.; Deborde, C.; Maucourt, M.; Biais, B.; Rolin, D.; Moing, A.; Moco, S.; Vervoort, J.; Lommen, A.; Schafer, H.; Humpfer, E.; Beale, M. H. Metabolomics 2010, 2010 6, 263-273. (4) Lee, S.; Wen, H.; An, Y. J.; Cha, J. W.; Ko, Y. J.; Hyberts, S. G.; Park, S. Anal Chem 2017, 2017 89, 1078-1085. (5) Wen, H.; An, Y. F.; Xu, W. J.; Kang, K. W.; Park, S. Angew Chem Int Edit 2015, 2015 54, 5374-5377. (6) Ryoo, I.; Kwon, H.; Kim, S. C.; Jung, S. C.; Yeom, J. A.; Shin, H. S.; Cho, H. R.; Yun, T. J.; Choi, S. H.; Sohn, C. H.; Park, S.; Kim, J. H. Scientific Reports 2016, 2016 6. (7) Ogura, T.; Date, Y.; Masukujane, M.; Coetzee, T.; Akashi, K.; Kikuchi, J. Scientific Reports 2016, 2016 6. (8) Asakura, T.; Sakata, K.; Yoshida, S.; Date, Y.; Kikuchi, J. Peerj 2014, 2014 2. (9) Chikayama, E.; Sekiyama, Y.; Okamoto, M.; Nakanishi, Y.; Tsuboi, Y.; Akiyama, K.; Saito, K.; Shinozaki, K.; Kikuchi, J. Anal Chem 2010, 2010 82, 1653-1658. (10) Kikuchi, J.; Tsuboi, Y.; Komatsu, K.; Gomi, M.; Chikayama, E.; Date, Y. Anal Chem 2016, 2016 88, 659-665. (11) Cui, Q.; Lewis, I. A.; Hegeman, A. D.; Anderson, M. E.; Li, J.; Schulte, C. F.; Westler, W. M.; Eghbalnia, H. R.; Sussman, M. R.; Markley, J. L. Nat Biotechnol 2008, 2008 26, 162-164. (12) Wishart, D. S.; Jewison, T.; Guo, A. C.; Wilson, M.; Knox, C.; Liu, Y. F.; Djoumbou, Y.; Mandal, R.; Aziat, F.; Dong, E.; Bouatra, S.; Sinelnikov, I.; Arndt, D.; Xia, J. G.; Liu, P.; Yallou, F.; Bjorndahl, T.; Perez-Pineiro, R.; Eisner, R.; Allen, F.; Neveu, V.; Greiner, R.; Scalbert, A. Nucleic Acids Research 2013, 2013 41, D801-D807. (13) Bingol, K.; Zhang, F. L.; Bruschweiler-Li, L.; Bruschweiler, R. Anal Chem 2012, 2012 84, 9395-9401. (14) Bingol, K.; Bruschweiler-Li, L.; Li, D. W.; Bruschweiler, R. Anal Chem 2014, 2014 86, 5494-5501. (15) Ludwig, C.; Easton, J. M.; Lodi, A.; Tiziani, S.; Manzoor, S. E.; Southam, A. D.; Byrne, J. J.; Bishop, L. M.; He, S.; Arvanitis, T. N.; Gunther, U. L.; Viant, M. R. Metabolomics 2012, 2012 8, 8-18. (16) Steinbeck, C.; Kuhn, S. Phytochemistry 2004, 2004 65, 2711-2717. (17) Hao, J.; Astle, W.; De Iorio, M.; Ebbels, T. M. D. Bioinformatics 2012, 2012 28, 2088-2090. (18) Xia, J. G.; Sinelnikov, I. V.; Han, B.; Wishart, D. S. Nucleic Acids Research 2015, 2015 43, W251-W257. (19) Worley, B.; Powers, R. Acs Chem Biol 2014, 2014 9, 1138-1144. (20) Ito, K.; Tsutsumi, Y.; Date, Y.; Kikuchi, J. Acs Chem Biol 2016, 2016 11, 1030-1038. (21) Cloarec, O.; Dumas, M. E.; Craig, A.; Barton, R. H.; Trygg, J.; Hudson, J.; Blancher, C.; Gauguier, D.; Lindon, J. C.; Holmes, E.; Nicholson, J. Anal Chem 2005, 2005 77, 1282-1289. (22) Shiokawa, Y.; Misawa, T.; Date, Y.; Kikuchi, J. Anal Chem 2016, 2016 88, 2714-2719. (23) Misawa, T.; Komatsu, T.; Date, Y.; Kikuchi, J. Chem Commun 2016, 2016 52, 2964-2967. (24) Breiman, L. Machine Learning 2001, 2001 45, 5-32. (25) Vapnik, V. N. John Wiley & Sons 1998. 1998 (26) Gromski, P. S.; Muhamadali, H.; Ellis, D. I.; Xu, Y.; Correa, E.; Turner, M. L.; Goodacre, R. Anal Chim

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Acta 2015, 2015 879, 10-23. (27) Mahadevan, S.; Shah, S. L.; Marrie, T. J.; Slupsky, C. M. Anal Chem 2008, 2008 80, 7562-7570. (28) Hochrein, J.; Klein, M. S.; Zacharias, H. U.; Li, J.; Wijffels, G.; Schirra, H. J.; Spang, R.; Oefner, P. J.; Gronwald, W. J Proteome Res 2012, 2012 11, 6242-6251. (29) Kwon, Y. K.; Bong, Y. S.; Lee, K. S.; Hwang, G. S. Food Chemistry 2014, 2014 161, 168-175. (30) Karu, N.; Wilson, R.; Hamede, R.; Jones, M.; Woods, G. M.; Hilder, E. F.; Shellie, R. A. J Proteome Res 2016, 2016 15, 3827-3840. (31) Bertini, I.; Luchinat, C.; Miniati, M.; Monti, S.; Tenori, L. Metabolomics 2014, 2014 10, 302-311. (32) Fathi, F.; Majari-Kasmaee, L.; Mani-Varnosfaderani, A.; Kyani, A.; Rostami-Nejad, M.; Sohrabzadeh, K.; Naderi, N.; Zali, M. R.; Rezaei-Tavirani, M.; Tafazzoli, M.; Arefi-Oskouie, A. Magnetic Resonance in

Chemistry 2014, 2014 52, 370-376. (33) Bertini, I.; Calalbro, A.; De Carli, V.; Luchinat, C.; Nepi, S.; Porfirio, B.; Renzi, D.; Saccenti, E.; Tenori, L. J Proteome Res 2009, 2009 8, 170-177. (34) Cacciatore, S.; Luchinat, C.; Tenori, L. P Natl Acad Sci USA 2014, 2014 111, 5117-5122. (35) Yoshida, S.; Date, Y.; Akama, M.; Kikuchi, J. Sci Rep 2014, 2014 4, 7005. (36) Misawa, T.; Wei, F.; Kikuchi, J. Anal Chem 2016, 2016 88, 6130-6134. (37) Lewis, I. A.; Schommer, S. C.; Markley, J. L. Magn Reson Chem 2009, 2009 47, S123-S126. (38) R Core Team. R Foundation for Statistical Computing 2015, 2015 https://www.R-project.org/. (39) Chen, T.; Li, M.; Li, Y.; Lin, M.; Wang, N.; Wang, M.; Xiao, T.; Xu, B.; Zhang, C.; Zhang, Z. arXiv

preprint 2015, 2015 arXiv:1512.01274. (40) Mevik, B. H.; Wehrens, R. J Stat Softw 2007, 2007 18, 1-23. (41) Liaw, A.; Wiener, M. R News 2002, 2002 2, 18-22. (42) Chatzimichali, E. A.; Bessant, C. Metabolomics 2016, 2016 12, 16. (43) Heinemann, J.; Mazurie, A.; Tokmina-Lukaszewska, M.; Beilman, G. J.; Bothner, B. Metabolomics 2014, 2014 10, 1121-1128.

ACS Paragon Plus Environment

Page 10 of 18

Page 11 of 18

Figures and Tables

Original data Modeling data

Evaluation data

Repeat a calculation at least k times

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Fully-connected (200) Activation (ReLU) Fully-connected (200) Activation (ReLU) Fully-connected (2) Soft max Evaluation (with k-fold cross validation) Permutation (50 per variables) Calculation of importance Output (accuracy and importance)

Figure 1 Flowchart of a constructed DNN-MDA algorithm.

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 2 Classification accuracy of the DNN-MDA, PLS, SVM, and RF algorithms.

ACS Paragon Plus Environment

Page 12 of 18

Page 13 of 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Figure 3 ROC curves of the DNN-MDA, SVM, and RF algorithms.

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 4 Identification of the important variables contributed to the constructed models by the DNN-MDA algorithm. In the figure, FAs refers to fatty acids, IMP refers to inosinic acid, PC refers to phosphatidylcholine, PUFAs refers to polyunsaturated fatty acids, and TMAO refers to trimethylamine N-oxide.

ACS Paragon Plus Environment

Page 14 of 18

Page 15 of 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Figure 5 Validation of the important variables identified by the DNN-MDA algorithm. A significance test (i.e., a Welch’s t test with Bonferroni correction) was performed for data from the Kanto region and from the other regions. The abbreviations in this figure are as follows: FA represents fatty acid, FC represents fold change, IMP represents inosinic acid, PC represents phosphatidylcholine, PUFA represents polyunsaturated fatty acid, and TMAO represents trimethylamine N-oxide. It is noted that the TMAO peak was overlapped with histidine and PC.

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 6 Relationship between the classification performance of the DNN-MDA algorithm and the number of samples used for the analyses.

ACS Paragon Plus Environment

Page 16 of 18

Page 17 of 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Table 1 Confusion matrices and classification performance for characteristics of yellowfin goby extracted from both the Chugoku and non-Chugoku regions for the PLS, SVM, RF, and DNN-MDA methods. observed

PLS SVM RF DNN-MDA

predicted Chugoku

Others

Chugoku

0

47.0

Others

0

975.0

Chugoku

45.6

1.4

Others

9.8

965.3

Chugoku

31.2

15.8

Others

1.0

974.0

Chugoku

42.4

4.6

Others

2.9

972.2

sensitivity specificity accuracy 0

1.000

0.954

0.970

0.990

0.989

0.664

0.999

0.984

0.902

0.997

0.993

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

for TOC only

ACS Paragon Plus Environment

Page 18 of 18