A Case Study in Human Hepatocellular Carcinoma - ACS Publications

Apr 27, 2017 - demand for practical and effective data mining methods for in silico analysis. .... In our example, the chosen discretizer, through its...
3 downloads 0 Views 1MB Size
Subscriber access provided by UB + Fachbibliothek Chemie | (FU-Bibliothekssystem)

Article

A Computational Selection of Metabolite Biomarkers Using Emerging Pattern Mining. A Case Study in human Hepatocellular Carcinoma. Guillaume Poezevara, Sylvain Lozano, Bertrand Cuissart, Ronan Bureau, Pierre Bureau, Vincent Croixmarie, Philippe Vayer, and Alban Lepailleur J. Proteome Res., Just Accepted Manuscript • Publication Date (Web): 27 Apr 2017 Downloaded from http://pubs.acs.org on April 27, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

A Computational Selection of Metabolite Biomarkers Using Emerging Pattern Mining. A Case Study in human Hepatocellular Carcinoma. Guillaume Poezevara a,b,c, Sylvain Lozano d, Bertrand Cuissart b, Ronan Bureau a, Pierre Bureau a

a

, Vincent Croixmarie d, Philippe Vayer d and Alban Lepailleur a,*

Centre d’Etudes et de Recherche sur le Médicament de Normandie, Normandie Univ,

UNICAEN, CERMN, 14000 Caen, France b

Groupe de Recherche en Informatique, Image, Automatique et Instrumentation de Caen,

Normandie Univ, UNICAEN, ENSICAEN, CNRS, GREYC, 14000 Caen, France c

QUIID SAS, 17 rue Claude Bloch, 14000 Caen

d

Technologie Servier, 27 rue Eugène Vignat, 45000 Orléans, France

*

Corresponding

author

phone:

(33)2-31-56-68-22;

fax:

(33)2-31-56-68-03;

e-mail:

[email protected]

ACS Paragon Plus Environment

1

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 32

ABSTRACT

The biomarker development in metabolomics aims at discriminating diseased from normal subjects and at creating a predictive model which can be used to diagnose new subjects. From a case study on human hepatocellular carcinoma (HCC), we studied for the first time the potential usefulness of the emerging patterns (EPs) which comes from the data mining domain. When applied to a metabolomics dataset labeled with two classes (e.g. HCC patients vs healthy subjects), EP mining can capture differentiating combinations of metabolites between the two classes. We observed that the so-called jumping emerging patterns (JEPs), which correspond to the combinations of metabolites that occur in only one of the two classes, achieved better performance than individual biomarkers. Particularly, the implementation of the JEPs in a rulesbased diagnostic tool drastically reduced the false positive rate, i.e the rate of healthy subjects predicted as HCC patients.

KEYWORDS

Metabolomics, Biomarkers, Human Hepatocellular Carcinoma, HCC, Data Mining, Emerging Patterns.

ACS Paragon Plus Environment

2

Page 3 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

INTRODUCTION The last decade has seen an explosion in the amount of biological data with the advent of omics technologies. These technologies, which have emerged as essential methodological approaches for interpreting and understanding complex biological processes, utilize highthroughput screening techniques to simultaneously monitor thousands of molecular components. Omics span an increasingly wide range of fields, including genomics, transcriptomics, proteomics, and metabolomics. Even if its recognition as a distinct scientific area is much more recent than the other omics, metabolomics is now considered as a promising field for the identification of key metabolic features which characterize certain pathological and physiological states.1,2 Indeed, alterations in metabolism resulting from functional responses to any given condition cause changes in the concentration of metabolites. These changes form indicators, or biomarkers, of normal biological states, pathological processes or pharmacological responses to a therapeutic intervention.3,4 Traditionally, molecular biomarkers that distinguish diseased subjects from normal subjects are widely used in clinical practice due to their ease of measurement. However, most of currently applied biomarkers are single metabolites and many of them suffer from high false positive rates, which seriously limit their routine use. To overcome this limitation, tracking a set of metabolic biomarkers seems to be a more robust proof than independently following single metabolites in order to characterize the investigated disease.5 Fundamentally, the goal of biomarker development in metabolomics is to create a predictive model which can be used to classify new subjects into specific groups (e.g. diseased vs healthy subjects).6 One of the strongest challenge lies in identifying an optimal set of metabolites that will provide the maximal discriminating power between the two groups. Then, a mathematical equation or a computer algorithm has to be found to combine the set of selected metabolites with

ACS Paragon Plus Environment

3

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 32

the aim of accurately predict the clinical outcome. This has led to a demand for practical and effective data mining methods for in silico analysis. These data mining methods range from an univariate statistical testing to a multivariate regression, the latter being of special importance since, as mentioned above, one biomarker alone is often not sufficiently specific for tracing a given state by itself. Multivariate methods include principal component analysis (PCA), partial least squares (PLS), orthogonal projections to latent structures (OPLS), cluster analysis, machine learning techniques and non-linear methods, e.g. support vector machine (SVM) and artificial neural networks (ANN).7–9 When one considers a metabolomics dataset whose objects are labeled with a class value (e.g. diseased vs. healthy subjects), it is invaluable to extract sets of metabolites whose occurrences are correlated with one class.10 In our framework, such a set is named a pattern. In data mining, the notion of an emerging pattern (EP)11 embodies our intention by using the growth-rate measure. When a metabolomics data set is partitioned between diseased subjects and nondiseased ones, the growth-rate of a pattern corresponds to the ratio of its frequency in the diseased subjects to its frequency outside the diseased subjects. An EP denominates a pattern whose growth-rate exceeds a given minimum threshold, denoting a certain level of correlation with a class label. Thus, the concept of an EP enables to capture the significant differences between two classes of subjects. As a biomarker holds a numerical value, one challenge in mining metabolomics data to compute EP lies in the required discretization preprocessing step; for a given biomarker, the latter step aims at partitioning its range of concentration into a sequence of intervals. Discretization methods fall into two distinct categories: the unsupervised methods, which do not use any information from the class label, and the supervised methods, which make use of the

ACS Paragon Plus Environment

4

Page 5 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

given class label. It has been shown that supervised discretization is more beneficial to classification than unsupervised discretization.12 Our work applies a supervised discretization method that makes use of the Fisher test. From a case study on human hepatocellular carcinoma (HCC),13 our work investigated the following problems (see Figure 1): (i) the discretization of a range of concentrations of biomarkers into informative and reliable intervals, (ii) the selection of relevant biomarkers with respect to the HCC pathology, (iii) the computation of sets of biomarkers that occur predominantly in HCC patients, and (iv) the use of this knowledge for diagnosis support. MATERIALS AND METHODS Dataset. The dataset consists of the metabolic profiling of 153 subjects which were collected from a paper published by Chen et al.13 The profiles result from the GC-TOFMS analysis of serum samples and provide the quantitative measurement of 317 metabolites. To simplify the writing in the following, the intensity of the mass spectrometry signal is referred as “concentration”. A total of 82 patients suffering from the human hepatocellular carcinoma (HCC) were considered in the study. Control samples were collected from 71 healthy volunteers using the same sample collection protocol. In the early step, 9 HCC patients and 8 healthy controls were randomly drawn and were used as an external test set (10 % of each sample) to measure the performances of the final diagnostic tool. The remaining subjects constituted the training set (136 subjects in total) and were used in a five-fold cross-validation scheme.14 For the sake of completeness, it should be noticed that we did not differentiate the patients with liver cirrhosis and hepatitis from the patients without liver cirrhosis and hepatitis. Discretization protocol. As the concentration of a metabolite is a continuous measurement, it turns impossible to make an inductive process without grouping similar concentrations into

ACS Paragon Plus Environment

5

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 32

intervals. We have selected a hierarchical discretizer15 which starts by associating an interval to each concentration measure and then incrementally merges two consecutive intervals by relying on a statistical measure of dependency.16 Given a metabolite, when one considers grouping a range of concentration measurements into one interval, the p-value of a two-tailed Fischer test quantifies the deviation of the labels between the elements inside the investigated interval and the elements outside of it. As a discretizer partitions the range of concentration measurements into potentially several intervals, the following score (Equation 1) is used to assess a partition: | |

Equation 1. Calculation of the score of a partition. where

denotes the range of concentrations to discretize,

intervals, any interval of

,

denotes a partition of

into

the p-value of the two-tailed Fischer test related to in !"

#" $%" & '"()*+% "&,"-)+.%/ -!" (0"| "|" $%"()*+% "&,"-)+.%/ -"1(" $%"1( % 2 '" .

The score function is the sum of the p-values related to each interval of a partition. In order to take into account the population, the sum is weighted by the number of elements of each interval. As the p-value of an interval lowers when it groups a larger part of elements with the same label, the lower this score is, the better the partition is. Thus, the aim of the discretizer is to find a partition that minimizes the score. The supervised discretization protocol was illustrated in Figure 2, where a set of 10 observations is provided: subjects 1-5 suffer from HCC while subjects 6-10 represent healthy controls. In this example, the concentrations of a metabolite are reported on the second line of the table. They range from 0.1 to 0.6. The figure depicts every partition investigated by the discretizer and traces its execution on this instance: for each partition, it indicates its intervals, its composition of patients – a D stands for the HCC patients while a H denotes healthy subjects –

ACS Paragon Plus Environment

6

Page 7 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

and its score. As the selected discretizer is a merging technique, it starts by creating an interval for each observed concentration: the first partition is

3

with the score of 2.57. Then it iteratively

passes through its improvement process. At each iteration, it computes the score related to any possible partition that can be obtained by merging two consecutive intervals of the current partition. Then, if a score of a newly envisaged partition is better than the score of the current partition, the discretizer replaces the current partition with the partition related to the best score. Otherwise, as none of the scores surpasses the score of the current partition, the discretizer 3

returns the current partition and it stops. On the example, as generate four different partitions from

3

gathers five intervals, one can

by merging two consecutive intervals. On Figure 2,

these partitions are pictured on the second level, numbered from

3 4

to

5 4 .

the lowest score obtained among its sibling partitions and as it improves the score of replaces 5 4

3

refinement of

5 4

5 4

3 6 ,

4 6

and

6 6 .

Based on its score,

3 6

is the best

and it becomes the current partition. During the third iteration, two new

partitions are compared with 3 7

3,

is

as the current partition. During the second iteration, three partitions originate from

by merging two successive intervals:

score of

5 4

As the score of

3 6

and the discretizer retains

is higher than the score of

4 5 .

4 5 .

During the fourth iteration, the

As no improvement has been found at this level, the

algorithm stops and returns the current partition which corresponds to

4 5 .

If all the observations

of two successive intervals belong to the same class, the discretizer always merges the two intervals, as any other merging discretizer. The specificity of a merging discretizer lies in its capability to retain intervals whose population does not completely belong to one class. In our example, the chosen discretizer, through its scoring function, has preferred

4 5

over

3 6 .

ACS Paragon Plus Environment

7

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 32

In our method, the discretizer is first applied on every metabolite. Then, the concentrations of each metabolite are rewritten based on the partition provided by the discretizer. As a particular case, if a metabolite range is discretized into a single interval, the metabolite is removed from the description of the subjects since it will not provide any contrasting information. Feature selection. In order to focus on the most promising metabolites – and more precisely on the most promising of their intervals of concentration – we set a feature selection process. For ranking metabolites we used the Pearson’s Correlation Coefficient17 which is also known as the Matthews Correlation Coefficient (MCC)18 when dealing with a two-classes confusion matrix.19 The MCC is calculated from the confusion matrix using Equation 2: 899

= : >
1.0 in Chen et al. corresponds to an upper interval in our study, and a FC < 1.0 corresponds to a lower interval. Besides, in our study, only 25 out of these 36 metabolites were considered as relevant to significantly discriminate the HCC patients from the healthy subjects. These 25 common metabolites are

ACS Paragon Plus Environment

12

Page 13 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

highlighted in italics in Table 1. Concerning the 11 additional metabolites reported by Chen et al., it should be noticed that when relaxing our challenging 0.5 minimum threshold for the MCC, most of them would have been also considered as relevant in the following steps of the study – particularly G2 (2,2'-bipyridine), G30 (cysteine), G43 (fumaric acid), G62 (leucine), G80 (phenylalanine), and G95 (tryptophan) which exhibit an MCC and an accuracy of prediction greater than 0.4 and 0.7, respectively (see Supplementary Table S1). But according to our analysis, these 11 additional metabolites are not able to retrieve new HCC patients; On the contrary, they wrongly predict two supplementary healthy subjects as diseased. The 65 potential biomarkers identified through the feature selection step (Table 1) were first considered in isolation for HCC prediction. The rationale for the predictions was as follows: if one or more metabolites relevant for HCC were detected in the metabolic profiling of a subject, then the subject was classified as HCC patient with a confidence related to the MCC value of its most relevant metabolite. First, we observed that all the 73 HCC patients of the training set belong to the lower range of concentrations of G73 (oleamide), G78 (ornithine), G88 (stearic acid), G250, and G280. However, none of them is fully specific of the HCC patients since each of these metabolites is also deregulated for at least 4 healthy subjects. The only metabolite which displays a concentration range never observed in the healthy subjects is G287, but it only covers 60 out of 73 HCC patients. By considering the other relevant metabolites as single biomarkers, the accuracy of the HCC prediction rapidly decreased since we retrieved less HCC patients and/or we wrongly predicted more healthy subjects as diseased (Table 1). As a conclusion, no single metabolite is able to perfectly predict all the HCC patients without misclassifying some healthy subjects. In other words, a significant difference in the average levels of a single

ACS Paragon Plus Environment

13

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 32

metabolite between two groups of subjects does not necessarily mean the metabolite will be a good single biomarker for diagnosis. Emerging Patterns and identification of complex metabolic signatures. The use of individual biomarkers for diagnostic applications generally leads to high false positive or high false negative rates due to various reasons like genetic variations or environmental factors. It is widely recognized that a set of biomarkers is best suited to distinguish diseased patients from healthy subjects and to achieve better performance. In this context, we apply the notion of emerging pattern (EP) mining which comes from the data mining domain. This notion perfectly meets the objective of the discovery of sets of biomarkers since EP mining corresponds to the search of combinations of features that capture useful contrasting information between two classes without being biased by scientist’s expectations. In this study we decided to generate the EPs from the 65 metabolites we identified as relevant biomarkers for HCC during the previous step. In order to prevent the problematic prediction of false positives (FP, healthy subjects predicted as diseased patients), which is the main issue encountered with biomarkers applied in isolation, we focused on the analysis of the jumping emerging patterns (JEPs), i.e the patterns corresponding to the combinations of metabolites that occur in only one of the two classes. As reported in Table 2, no JEP was generated when we only considered the 4 metabolites with a MCC greater than 0.9, i.e G73 (oleamide), G247, G250, and G280. This means that no combination was able to correct the FP previously predicted by separately considering these metabolites. From the 7 metabolites with a MCC greater than 0.8, four JEPs were obtained. However, the predicting performances are not better than considering G287 as a single biomarker since the 13 uncovered diseased subjects of the learning set remain (Table 2), thus corresponding to false negatives (FN, diseased patients predicted as healthy subjects). When these 4 metabolic signatures are applied on the external test

ACS Paragon Plus Environment

14

Page 15 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

set for diagnosis of HCC, 3 out of the 9 diseased subjects are also not detected. By also including the 3 other metabolites with a MCC greater than 0.7 during the EP mining process, we obtained 25 JEPs which are able to decrease the number of misclassifications to 2 FN for the training set and one FN for the external test set. From the 36 metabolites with a MCC greater than 0.6, we generated 4398 JEPs. Although we no longer predict FN for the training set, one still persists for the external test set. Finally, by considering all the 65 relevant metabolites, leading to 12929 JEPs, no better metabolic signature was proposed. The remaining subject misclassified as diseased (patient ID 124) escapes detection because its G22 (arabinose), G78 (ornithine), and G287 concentrations are not in the ranges prevailing in the HCC patients of the training set. Concerning the healthy subjects of the training set, none of them is obviously predicted as diseased, i.e as a FP, since the JEPs are associated with only one class – here, the HCC patients. More interestingly, all the healthy subjects of the external test set are also correctly predicted, denoting a good classification accuracy of our metabolic signatures on unseen data. In predictive biomarker studies, the performance of metabolic signatures can be determined by comparing the predicted outcome to the true outcome for an external test set randomly drawn from the initial population.6 As receiver-operating characteristic (ROC) curve is widely considered to be a statistically valid method for biomarker performance evaluation,33 we calculated the true positive rate and the true negative rate for different MCC thresholds (Table 2) assuming subjects that exhibit a JEP in their profile are predicted diseased and subjects that lack all the JEPs are predicted healthy. The corresponding ROC curve, whose area under the curve (AUC) is 0.889, is shown on Figure 4. Note the peculiar shape of the curve due to the absence of FP. Considered individually, the 45 JEPs providing the maximal discriminating power between the diseased and healthy subjects are shown in Table 3. All of them display a MCC value greater than 0.9 on the

ACS Paragon Plus Environment

15

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 32

training dataset. The most promising metabolic signature for HCC diagnosis corresponds to the combination of G73 (oleamide), G78 (ornithine), G88 (stearic acid), G247, G250, and G280. This JEP is present in 72 out of the 73 HCC patients of the training set and in all but one HCC patient of the external test set (patient ID 124). It should be noticed that all the 12929 JEPs share a basic set of 5 metabolites: G73 (oleamide), G78 (ornithine), G88 (stearic acid), G250, and G280. This combination was not obtained alone because one healthy subject of the training set is covered by this metabolic signature, thus corresponding to a FP. Even if this combination is not a JEP, its discriminating performance between HCC patients and healthy controls is very high and it holds true on the external test set (no misclassification). All the metabolites that complete this basic set of 5 metabolites were added to fulfill the jumping property of the JEPs. By contrast, the JEPs generated from the 65 intervals of concentrations associated to the healthy subjects, i.e those with a MCC lower than - 0.5, only led to 10 JEPs (data not shown). All of them are based on the combination of the upper intervals of concentrations of G247 and G287. This result seems to underline a more stable metabolic profile among the healthy subjects in comparison with the HCC patients. Comparison with state-of-the-art classification techniques. State-of-the-art classifiers can be used in metabolomics data analysis to discriminate diseased patients from healthy subjects. After data normalization, support vector machine (SVM), artificial neural network (ANN), and random forest (RF) were performed on the dataset by fulfilling the 90:10 proportion of the training and the external test sets. One may note our approach performs as well as these state-of-the-art algorithms on the external test set (Table 4). All the methods misclassify just one diseased subject (patient ID 124), except for the SVM that makes no mistake. Unlike the three machine

ACS Paragon Plus Environment

16

Page 17 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

learning methods, our pattern-based process has the decisive advantage of providing association rules that explain its decision. Emerging Patterns and metabolic pathways. The study of the so-called network biomarkers attracted attentions in the recent years because biological pathways are considered to be more confident to characterize diseases than individual molecules.34–36 Since the emerging patterns (EPs) correspond to conjunctions of metabolites, we hypothesized that some EPs could associate metabolites involved in the same metabolic pathways. The fatty acids metabolism is a leading example of the link between EPs and metabolic pathways. Some combinations of fatty acids involving arachidonic acid (G23), docosahexaenoic acid (G40), nervonic acid (G71), and stearic acid (G88) are shown in Table 5 along with their growth-rate, which is the measurement corresponding to the ratio of the frequency of the pattern in the diseased subjects to its frequency in the healthy subjects. Glycerol (G46), which is combined to fatty acids via ester linkages to form triglycerids, is sometimes associated to these EPs. The urea cycle is another illustration of the retrieval of metabolic pathways, as exemplified by the EP grouping aspartic acid (G25), citrulline (G28), and ornithine (G78). Although encouraging, the calculated growth-rates of these deregulated combinations are below expectations. This could be explained by the limitation of the EP mining process to the 65 metabolites resulting from the feature selection step. In future work, we propose to focus on the interplay of related metabolites, even the less significant ones, by incorporating knowledge on metabolic pathways. CONCLUSION For the first time, we applied emerging pattern (EP) mining to identify sets of metabolite biomarkers that are useful to discriminate diseased from healthy subjects. Since EP mining is not suitable for continuous values, we also provided a method to discretize the ranges of

ACS Paragon Plus Environment

17

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 32

concentrations of the metabolites as well as select meaningful features among them. We observed that combinations of biomarkers achieve better performance than individual biomarkers. Particularly, the false positive rate – the rate of healthy subjects predicted as HCC patients – is drastically reduced with the so-called jumping emerging patterns (JEPs), opening the way to their potential usefulness in clinical applications to diagnosis. In future work, we propose to include metabolites which are not significant in isolation but which can produce a reliable discrimination when they are combined. Furthermore, we intend to incorporate knowledge on metabolic pathways to help reveal key underlying biological processes and to discover network biomarkers.

ASSOCIATED CONTENT AVAILABLE Supporting information Supplementary Table S1. Summary of the 36 differentially expressed serum metabolites in HCC patients relative to healthy subjects according to Chen et al.13 and resulting from GC-TOFMS analysis of serum samples.

ACS Paragon Plus Environment

18

Page 19 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Journal of Proteome Research

TABLES Table 1. Summary of the differentially expressed serum metabolites in HCC patients relative to healthy subjects. # G73 G250 G247 G280 G88 G254 G287 G22 G31 G320 G277 G172 G321 G125 G68 G19 G276 G162 G273 G78 G116 G293 G7 G305 G146 G160 G165 G275 G262 G29 G269 G171 G16 G129 G244 G303 G289 G283

Metabolite Name Oleamide

Stearic acid

Arabinose Cystine

Methionine 4-ketoglucose

Ornithine

2,3-Dihydroxyl-propanoic acid

Creatinine

3-Amino-2-piperidone

# #145 #503 #497 #561 #173 #511 #577 #42 #63 #646 #555 #348 #648 #253 #136 #34 #553 #326 #549 #156 #235 #590 #10 #616 #296 #322 #334 #551 #527 #58 #540 #346 #28 #263 #491 #611 #581 #567

Metabolite concentration Interval Total [5:8[ 77 [13:15[ 77 [11:13[ 75 [2:8[ 78 [12:15[ 81 [9:27[ 74 [12:14[ 60 [2:49[ 79 [124:212] 87 [5:282[ 74 [106:518[ 76 [79:726[ 63 [1:201[ 83 [20317:1.43185e+06[ 71 [1:154[ 79 [1:271[ 86 [1:14[ 86 [3.80942e+06:5.46004e+10[ 84 [1291:2543] 70 [12:16[ 99 [9:751[ 89 [4265:4821] 69 [57:88[ 64 [5:11] 93 [5:256[ 72 [284:3.49623e+08[ 78 [1:1.74612e+09[ 78 [1:253[ 61 [91:238[ 89 [526:3935[ 71 [1131:2991[ 77 [1:6.23279e+10[ 88 [9:36[ 72 [24566:250332[ 78 [1:357[ 78 [1:13[ 74 [39:505[ 58 [381:543[ 85 ACS Paragon Plus Environment

Support HCC Control 73 4 73 4 72 3 73 5 73 8 68 6 60 0 68 11 71 16 64 10 64 12 57 6 67 16 61 10 65 14 68 18 68 18 67 17 60 10 73 26 69 20 59 10 56 8 70 23 60 12 63 15 63 15 54 7 68 21 59 12 62 15 67 21 59 13 62 16 62 16 60 14 51 7 65 20

Performance measures MCC 0.942 0.942 0.941 0.928 0.887 0.837 0.825 0.765 0.746 0.719 0.689 0.686 0.679 0.676 0.675 0.668 0.668 0.665 0.662 0.658 0.658 0.648 0.640 0.637 0.631 0.630 0.630 0.630 0.627 0.617 0.615 0.610 0.601 0.600 0.600 0.600 0.592 0.590

Accuracy 0.971 0.971 0.971 0.963 0.941 0.919 0.904 0.882 0.868 0.860 0.846 0.838 0.838 0.838 0.838 0.831 0.831 0.831 0.831 0.809 0.824 0.824 0.816 0.809 0.816 0.816 0.816 0.809 0.809 0.809 0.809 0.801 0.801 0.801 0.801 0.801 0.787 0.794

p-value 3.28E-34 3.28E-34 9.98E-34 5.11E-33 7.79E-30 2.94E-25 3.88E-26 1.04E-20 3.41E-20 3.87E-18 1.19E-16 1.09E-16 3.20E-16 6.26E-16 7.75E-16 8.21E-16 8.21E-16 1.38E-15 3.09E-15 1.25E-16 2.31E-15 1.45E-14 1.77E-14 1.20E-14 6.66E-14 1.14E-13 1.14E-13 4.51E-14 7.86E-14 2.84E-13 2.69E-13 3.18E-13 1.16E-12 1.06E-12 1.06E-12 1.16E-12 2.09E-12 2.61E-12

19

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Page 20 of 32

TABLES Table 1 (continued). Summary of the differentially expressed serum metabolites in HCC patients relative to healthy subjects. # G322 G14 G281 G28 G44 G46 G316 G27 G23 G61 G40 G71 G216 G85 G81 G137 G240 G243 G3 G55 G57 G145 G65 G233 G142 G319 G25

Metabolite Name 2-Piperidine carboxylic acid Citrulline Glucosamine Glycerol Citric acid Arachidonic acid Lactic acid Docosahexaenoic acid Nervonic acid Pyruvic acid Phosphoric acid

2,3-Dihydroxy-2(3H)-furanone Inositol-3-phosphate

Lysine

Aspartic acid

# #650 #25 #563 #56 #82 #86 #636 #54 #44 #123 #76 #142 #437 #170 #162 #279 #483 #489 #5 #108 #112 #294 #130 #469 #289 #644 #48

Metabolite concentration Interval [1:7[ [44:71[ [5:20[ [7:18[ [1:10[ [5897:9456[ [13:371[ [145:1072[ [1232:2469[ [43130:62350] [983:1737[ [242:745[ [5:25[ [28:44] [11886:18371[ [31:586[ [1:60[ [3:12[ [15:23] [59:1272[ [176:5937[ [41:316[ [15:8771[ [6542:31005[ [9:59[ [16:47[ [20:33[

Total 95 59 59 84 84 72 82 74 87 60 77 61 46 48 79 75 81 75 47 59 95 54 82 82 56 98 89

Support HCC Control 69 26 51 8 51 8 64 20 64 20 58 14 63 19 59 15 65 22 51 9 60 17 51 10 42 4 43 5 60 19 58 17 61 20 58 17 42 5 49 10 67 28 46 8 61 21 61 21 47 9 68 30 64 25

Performance measures MCC 0.579 0.575 0.575 0.574 0.574 0.572 0.572 0.571 0.562 0.558 0.555 0.541 0.539 0.532 0.526 0.526 0.526 0.526 0.520 0.516 0.514 0.513 0.512 0.512 0.508 0.506 0.503

Accuracy 0.779 0.779 0.779 0.787 0.787 0.787 0.787 0.787 0.779 0.772 0.779 0.765 0.743 0.743 0.765 0.765 0.765 0.765 0.735 0.750 0.750 0.743 0.757 0.757 0.743 0.743 0.750

p-value 4.20E-12 9.77E-12 9.77E-12 1.25E-11 1.25E-11 1.70E-11 1.72E-11 1.69E-11 2.92E-11 4.25E-11 5.79E-11 1.73E-10 5.78E-11 2.16E-10 6.24E-10 6.80E-10 9.08E-10 6.80E-10 6.02E-10 1.49E-09 1.41E-09 1.10E-09 2.62E-09 2.62E-09 1.72E-09 2.10E-09 3.37E-09

ACS Paragon Plus Environment

20

Page 21 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table 2. Classification performances using the jumping emerging patterns (JEPs) as complex metabolic signatures. Metabolite MCC threshold

Number of metabolites

Number of JEPs

0.9

4

0

0.8

7

4

0.7

10

25

0.6

36

4398

0.5

65

12929

Set Training Testing Training Testing Training Testing Training Testing Training Testing

TP 60 6 71 8 73 8 73 8

Classification a FP TN FN Accuracy 0.904 0 63 13 0.824 0 8 3 0.985 0 63 2 0.941 0 8 1 1 0 63 0 0.941 0 8 1 1 0 63 0 0.941 0 8 1

a

TP: number of true positives (actual diseased subjects predicted as diseased subjects); TN: number of true negatives (actual healthy subjects predicted as healthy subjects); FP: number of false positives (healthy subjects predicted as diseased subjects); FN: number of false negatives (diseased subjects predicted as healthy subjects)

ACS Paragon Plus Environment

21

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 32

Table 3. The 45 JEPs representing the most promising metabolic signatures for HCC diagnosis. # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 a

JEP a MET1 MET2 MET3 MET4 MET5 MET6 MET7 MET8 G73 G31 G31 G73 G73 G31 G73 G73 G19 G22 G73 G73 G73 G31 G31 G31 G19 G22 G22 G73 G73 G73 G31 G73 G57 G73 G73 G31 G73 G31 G73 G19 G22 G22 G31 G31 G31 G73 G31 G73 G57 G73 G73 G73 G73

G78 G73 G73 G78 G78 G73 G78 G78 G73 G73 G78 G78 G78 G73 G73 G73 G73 G73 G31 G78 G78 G78 G73 G78 G73 G78 G78 G73 G78 G73 G78 G31 G31 G73 G73 G73 G73 G78 G73 G78 G73 G78 G78 G78 G78

G88 G78 G78 G88 G88 G78 G88 G88 G78 G78 G88 G88 G88 G78 G78 G78 G78 G78 G73 G88 G88 G88 G78 G88 G78 G88 G88 G78 G88 G78 G88 G73 G73 G78 G78 G78 G78 G88 G78 G88 G78 G88 G88 G88 G88

G247 G88 G88 G247 G116 G88 G116 G247 G88 G88 G250 G250 G250 G88 G88 G88 G88 G88 G78 G247 G247 G247 G88 G247 G88 G162 G250 G88 G116 G88 G116 G78 G78 G88 G88 G88 G88 G250 G88 G250 G88 G162 G171 G162 G247

G250 G250 G247 G250 G250 G250 G247 G250 G250 G250 G262 G276 G280 G247 G116 G250 G247 G247 G88 G250 G250 G250 G250 G250 G250 G250 G280 G116 G250 G247 G250 G88 G88 G250 G250 G250 G247 G276 G250 G280 G247 G247 G247 G171 G250

G280 G280 G250 G280 G280 G280 G250 G280 G280 G280 G280 G280 G319 G250 G250 G280 G250 G250 G250 G254 G262 G276 G276 G280 G280 G280 G321 G247 G280 G250 G280 G250 G247 G280 G254 G262 G250 G280 G280 G305 G250 G250 G250 G250 G280

G280 G305 G305 G280 G322

G280 G280 G322 G280 G280 G280 G280 G280 G280 G280 G319

G305

G250 G305 G280 G322 G280 G250 G305 G280 G280 G276 G305 G319 G319 G280 G280 G280 G280 G321

G280 G322

G280

G280

Classification (training | testing) TP FP TN FN 72 | 8 71 | 7 70 | 7 69 | 8 69 | 8 68 | 7 68 | 8 68 | 8 68 | 8 68 | 7 68 | 8 68 | 8 68 | 7 67 | 7 67 | 7 67 | 7 67 | 8 67 | 7 67 | 6 67 | 7 67 | 8 67 | 8 67 | 7 67 | 7 67 | 7 67 | 7 67 | 8 66 | 7 66 | 8 66 | 7 66 | 8 66 | 7 66 | 6 66 | 7 66 | 6 66 | 7 66 | 7 66 | 8 66 | 6 66 | 7 66 | 7 66 | 7 66 | 7 66 | 7 66 | 8

0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0

63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63 63 | 63

1|1 2|2 3|2 4|1 4|1 5|2 5|1 5|1 5|1 5|2 5|1 5|1 5|2 6|2 6|2 6|2 6|1 6|2 6|3 6|2 6|1 6|1 6|2 6|2 6|2 6|2 6|1 7|2 7|1 7|2 7|1 7|2 7|3 7|2 7|3 7|2 7|2 7|1 7|3 7|2 7|2 7|2 7|2 7|2 7|1

MCC b 0.985 0.971 0.957 0.943 0.943 0.929 0.929 0.929 0.929 0.929 0.929 0.929 0.929 0.915 0.915 0.915 0.915 0.915 0.915 0.915 0.915 0.915 0.915 0.915 0.915 0.915 0.915 0.902 0.902 0.902 0.902 0.902 0.902 0.902 0.902 0.902 0.902 0.902 0.902 0.902 0.902 0.902 0.902 0.902 0.902

The corresponding intervals can be retrieved from Table 1; b MCC values calculated on the training set.

ACS Paragon Plus Environment

22

Page 23 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table 4. Confusion matrix on the external test set for state-of-the-art classifiers. Method SVM ANN RF a

Classification TP

FP

TN

FN

Accuracy

8 8 8

0 0 0

9 8 8

0 1 1

1 0.941 0.941

SVM: support vector machine; ANN: artificial neural network; RF: random forest.

ACS Paragon Plus Environment

23

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 32

Table 5. Examples of emerging patterns (EPs) related to metabolic pathways. EP a

Metabolic pathway

Fatty acids metabolism

Urea cycle a

G23 G23 G40 G40 G71 G23 G23 G46 G25

G40 G71 G71 G88 G88 G40 G46 G71 G28

G46 G71 G88 G78

Total 67 52 48 63 54 53 43 43 74

Support HCC 54 47 44 60 51 46 39 41 57

Control 13 5 4 3 3 7 4 2 17

Growth-rate 3.58 8.11 9.49 17.30 14.70 5.67 8.41 17.70 2.89

The corresponding intervals can be retrieved from Table 1.

ACS Paragon Plus Environment

24

Page 25 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

FIGURE LEGENDS

Figure 1. Workflow overview of the study including supervised discretization, metabolite selection, emerging patterns mining, and diagnostic tool implementation. Figure 2. Supervised discretization protocol applied to the concentration range of a hypothetical metabolite (D: diseased patient; H: healthy subject; P: partition; S: score). Figure 3. Hypothetical dataset containing the emerging pattern P14" !16" Q" (0" $%" .)* 1(]"

%*% ]1(]"

% ("P1["!1\"Q" A"3: class of HCC patients; A"4: class of healthy subjects; -^"subject; *:

metabolite; 1: interval)."

Figure 4. ROC curve (in blue) representing the performance evaluation of the JEPs as metabolic signatures for HCC diagnosis.

ACS Paragon Plus Environment

25

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 32

FIGURES

Figure 1.

ACS Paragon Plus Environment

26

Page 27 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 2.

ACS Paragon Plus Environment

27

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 32

Figure 3.

ACS Paragon Plus Environment

28

Page 29 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 4.

ACS Paragon Plus Environment

29

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 32

REFERENCES (1) Nordström, A.; Lewensohn, R. Metabolomics: Moving to the Clinic. J. Neuroimmune Pharmacol. 2010, 5 (1), 4–17. (2) Patti, G. J.; Yanes, O.; Siuzdak, G. Metabolomics: The Apogee of the Omics Trilogy. Nat. Rev. Mol. Cell Biol. 2012, 13 (4), 263–269. (3) Biomarkers Definitions Working Group. Biomarkers and Surrogate Endpoints: Preferred Definitions and Conceptual Framework. Clin. Pharmacol. Ther. 2001, 69 (3), 89–95. (4) Monteiro, M. S.; Carvalho, M.; Bastos, M. L.; Guedes de Pinho, P. Metabolomics Analysis for Biomarker Discovery: Advances and Challenges. Curr. Med. Chem. 2013, 20 (2), 257– 271. (5) Madsen, R.; Lundstedt, T.; Trygg, J. Chemometrics in Metabolomics - a Review in Human Disease Diagnosis. Anal. Chim. Acta 2010, 659 (1–2), 23–33. (6) Xia, J.; Broadhurst, D. I.; Wilson, M.; Wishart, D. S. Translational Biomarker Discovery in Clinical Metabolomics: An Introductory Tutorial. Metabolomics 2013, 9 (2), 280–299. (7) Trygg, J.; Holmes, E.; Lundstedt, T. Chemometrics in Metabonomics. J. Proteome Res. 2007, 6 (2), 469–479. (8) Worley, B.; Powers, R. Multivariate Analysis in Metabolomics. Curr. Metabolomics 2013, 1 (1), 92–107. (9) Bartel, J.; Krumsiek, J.; Theis, F. J. Statistical Methods for the Analysis of HighThroughput Metabolomics Data. Comput. Struct. Biotechnol. J. 2013, 4, e201301009. (10) Atzmueller, M. Subgroup Discovery. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2015, 5 (1), 35–49. (11) Dong, G.; Li, J. Efficient Mining of Emerging Patterns: Discovering Trends and Differences. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA; 1999; pp 43–52. (12) Dougherty, J.; Kohavi, R.; Sahami, M. Supervised and Unsupervised Discretization of Continuous Features. In Machine Learning: Proceedings of the Twelfth International Conference; Morgan Kaufmann, 1995; pp 194–202. (13) Chen, T.; Xie, G.; Wang, X.; Fan, J.; Qiu, Y.; Zheng, X.; Qi, X.; Cao, Y.; Su, M.; Wang, X.; et al. Serum and Urine Metabolite Profiling Reveals Potential Biomarkers of Human Hepatocellular Carcinoma. Mol. Cell. Proteomics MCP 2011, 10 (7), M110.004945. (14) Kohavi, R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 2; IJCAI’95; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1995; pp 1137–1143. (15) Bakar, A. A.; Othman, Z. A.; Shuib, N. L. M. Building a New Taxonomy for Data Discretization Techniques. In 2nd Conference on Data Mining and Optimization; 2009; pp 132–140. (16) Fisher, R. A. On the Interpretation of χ2 from Contingency Tables, and the Calculation of P. J. R. Stat. Soc. 1922, 85 (1), 87–94. (17) Pearson, K. Note on Regression and Inheritance in the Case of Two Parents. Proc. R. Soc. Lond. 1895, 58 (347–352), 240–242. (18) Matthews, B. W. Comparison of the Predicted and Observed Secondary Structure of T4 Phage Lysozyme. Biochim. Biophys. Acta 1975, 405 (2), 442–451.

ACS Paragon Plus Environment

30

Page 31 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(19) Powers, D. Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation. J. Mach. Learn. Technol. 2011, 2 (1), 37–63. (20) Maindonald, J. H. Data Mining with Rattle and R: The Art of Excavating Data for Knowledge Discovery. Int. Stat. Rev. 2012, 80 (1), 199–200. (21) Herrera, F.; Luengo, J.; Sáez, J. A.; López, V.; Garcia, S. A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning. IEEE Trans. Knowl. Data Eng. 2013, 25 (4), 734–750. (22) Ramírez-Gallego, S.; García, S.; Mouriño-Talín, H.; Martínez-Rego, D.; Bolón-Canedo, V.; Alonso-Betanzos, A.; Benítez, J. M.; Herrera, F. Data Discretization: Taxonomy and Big Data Challenge. Wiley Interdisc Rew Data Min. Knowl. Discov. 2016, 6 (1), 5–21. (23) Liu, H.; Hussain, F.; Tan, C. L.; Dash, M. Discretization: An Enabling Technique. Data Min Knowl Discov 2002, 6 (4), 393–423. (24) Yang, Y.; Webb, G. I.; Wu, X. Discretization Methods. In Data Mining and Knowledge Discovery Handbook, 2nd ed.; 2010; pp 101–116. (25) Kotsiantis, S.; Kanellopoulos, D. Discretization Techniques: A Recent Survey. 2006, 32 (1), 47–58. (26) Ockner, R. K.; Kaikaus, R. M.; Bass, N. M. Fatty-Acid Metabolism and the Pathogenesis of Hepatocellular Carcinoma: Review and Hypothesis. Hepatol. Baltim. Md 1993, 18 (3), 669–676. (27) Abel, S.; Smuts, C. M.; de Villiers, C.; Gelderblom, W. C. Changes in Essential Fatty Acid Patterns Associated with Normal Liver Regeneration and the Progression of Hepatocyte Nodules in Rat Hepatocarcinogenesis. Carcinogenesis 2001, 22 (5), 795–804. (28) Kimhofer, T.; Fye, H.; Taylor-Robinson, S.; Thursz, M.; Holmes, E. Proteomic and Metabonomic Biomarkers for Hepatocellular Carcinoma: A Comprehensive Review. Br. J. Cancer 2015, 112 (7), 1141–1156. (29) Chaerkady, R.; Harsha, H. C.; Nalli, A.; Gucek, M.; Vivekanandan, P.; Akhtar, J.; Cole, R. N.; Simmers, J.; Schulick, R. D.; Singh, S.; et al. A Quantitative Proteomic Approach for Identification of Potential Biomarkers in Hepatocellular Carcinoma. J. Proteome Res. 2008, 7 (10), 4289–4298. (30) Warburg, O. On the Origin of Cancer Cells. Science 1956, 123 (3191), 309–314. (31) Cocchetto, D. M.; Tschanz, C.; Bjornsson, T. D. Decreased Rate of Creatinine Production in Patients with Hepatic Disease: Implications for Estimation of Creatinine Clearance. Ther. Drug Monit. 1983, 5 (2), 161–168. (32) Watanabe, A.; Higashi, T.; Sakata, T.; Nagashima, H. Serum Amino Acid Levels in Patients with Hepatocellular Carcinoma. Cancer 1984, 54 (9), 1875–1882. (33) Soreide, K. Receiver-Operating Characteristic Curve Analysis in Diagnostic, Prognostic and Predictive Biomarker Research. J. Clin. Pathol. 2009, 62 (1), 1–5. (34) Jin, G.; Zhou, X.; Wang, H.; Zhao, H.; Cui, K.; Zhang, X.-S.; Chen, L.; Hazen, S. L.; Li, K.; Wong, S. T. C. The Knowledge-Integrated Network Biomarkers Discovery for Major Adverse Cardiac Events. J. Proteome Res. 2008, 7 (9), 4013–4021. (35) Ideker, T.; Sharan, R. Protein Networks in Disease. Genome Res. 2008, 18 (4), 644–652. (36) Liu, R.; Wang, X.; Aihara, K.; Chen, L. Early Diagnosis of Complex Diseases by Molecular Biomarkers, Network Biomarkers, and Dynamical Network Biomarkers. Med. Res. Rev. 2014, 34 (3), 455–478.

ACS Paragon Plus Environment

31

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 32

FOR TOC ONLY

ACS Paragon Plus Environment

32