New Method for Joint Network Analysis Reveals ... - ACS Publications

Jan 6, 2016 - arg max. ( ) j g. G gj gj g. 1. Because JRF utilizes the same splitting variable for ... certain prespecified threshold or the maximum n...
0 downloads 0 Views 3MB Size
Subscriber access provided by McMaster University Library

Article

A new method for joint network analysis reveals common and different co-expression patterns among genes and proteins in breast cancer FRANCESCA PETRALIA, Won-Min Song, Zhidong Tu, and Pei Wang J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.5b00925 • Publication Date (Web): 06 Jan 2016 Downloaded from http://pubs.acs.org on January 12, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

A new method for joint network analysis reveals common and different co-expression patterns among genes and proteins in breast cancer Francesca Petralia, Won-Min Song, Zhidong Tu,∗ and Pei Wang∗ Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai E-mail: [email protected]; [email protected]

Phone: +1 (212) 659-8508; +1 (212) 731-7052.

Abstract In this paper, we focus on characterizing common and different co-expression patterns among RNAs and proteins in breast cancer tumors. To address this problem, we introduce Joint Random Forest (JRF), a novel non-parametric algorithm to simultaneously estimate multiple co-expression networks by effectively borrowing information across protein and gene expression data. The performance of JRF was evaluated through extensive simulation studies using different network topologies and data distribution functions. Advantages of JRF over other algorithms which estimate classspecific networks separately were observed across all simulation settings. JRF also outperformed a competing method based on Gaussian graphic models. We then applied JRF to simultaneously construct gene and protein co-expression networks based on protein and RNAseq data from CPTAC-TCGA breast cancer study. We identified ∗

To whom correspondence should be addressed

1

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

interesting common and differential co-expression patterns among gene and proteins. This information can help to cast light on the potential disease mechanisms of breast cancer.

Introduction In a recent breast cancer study conducted by the Clinical Proteomic Tumor Analysis Consortium (CPTAC), 1,2 extensive global- and phospho- protein profiling were obtained for a subset of breast cancer samples that have been extensively characterized in the The Cancer Genome Atlas (TCGA). 3 This is so far the first attempt to study protein activities in breast cancer samples using sophisticated protein experiments on a large scale. Then, by leveraging genomic analytical outputs from TCGA on these samples, 3 we have the unique opportunity to characterize and compare gene co-expression networks and protein co-expression networks based on the same set of samples. This is of great interest because knowledge about the common and different co-expression patterns among RNAs and proteins could improve our understanding of complicated gene regulatory mechanisms in tumor samples, and could also facilitate the detection of important disease genes and therapeutical targets. In the last decade, numerous statistical and machine learning methods have been proposed to construct gene-gene regulatory networks and protein-protein regulatory networks, including co-expression network methods, 4 Bayesian network methods, 5,6 and Gaussian graphical models. 7–10 Review of these and other methods are available in Hecker et al. 11 and Lee et al. 12 However, all these proposed models are designed to construct one network at a time. Presumably, we can use RNA expression data and protein expression data to construct two networks separately and then compare. But such an approach is certainly less optimal, as gene expressions and protein expressions in one tumor sample are closely related. On one hand, protein levels are regulated by their RNA levels, so we expect the two co-expression networks share common structures. On the other hand, protein activities are subject to a large amount of post-transcriptional modifications, thus we also expect to observe unique 2

ACS Paragon Plus Environment

Page 2 of 33

Page 3 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

interaction patterns in each network. This motivates us to perform joint learning of both RNA-seq and proteomic networks, so we are able to borrow information across different data sets and better capture the common correlation structures, which shall then improve the overall accuracy of network estimation. Various methods have been proposed to jointly estimate different networks. 13–17 Some of them 13,14,17 have been specifically designed to estimate time-varying graphical models in the context of time-series data. Guo et al. 16 and Danaher et al. 15 proposed likelihoodbased methods for the joint estimation of multiple related Gaussian graphical models. Their approaches rely on two different regularization schemes penalizing differences across partial correlation structures of different classes. While both papers successfully demonstrate the advantage of jointly modeling multiple networks over estimating individual networks separately, the performance of the methods heavily depends on the Multivariate Gaussian assumptions of gene expression distributions, which may not hold in real biological systems. Recently, random forest based methods have been utilized for the construction of GRN. 18–20 Random forest 21 is a decision tree based non-parametric method for building powerful prediction models. It has been extensively utilized for classification and regression problems. 22 Its ensemble structure combined with its non-parametric nature delivers excellent performance even when the sample size is moderate. Recently, Huynh-Thu et al. 18 proposed GENIE3, a random forest based method for estimating networks. The key idea is to model the expression of each target gene as a function of the expression of all other genes via random forest. The superior performance of GENIE3 was demonstrated in both DREAM 4 in-silico multifactorial challenge 23 and DREAM 5 network inference challenge. 24 In both challenges, GENIE3 and its extensions scored the best. In Maduranga et al., 20 GENIE3 was further modified to estimate GRN based on time-series data. Recently, Petralia et al. 19 proposed iRafNet, another random forest based method which combines different types of heterogeneous data, such as protein-protein interactions, transcription factor-DNA binding, gene knock-down to estimate GRN. While these random forest based methods greatly advance the field of GRN

3

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

construction, none of them is designed to estimate simultaneously multiple related networks. In this paper, we propose JRF (Joint Random Forest), a new algorithm for the simultaneous estimation of protein co-expression and gene co-expression networks. The key idea is to borrow information across different data by forcing the class-specific tree ensembles to use the same genes for the splitting rules. In this way, regulatory relationships playing important roles in multiple classes will be detected with better power. After testing the performance of our algorithm on several in-silico experiments and compared JRF with different methods including GENIE3 18 the Joint Graphical Lasso (JGL), 15 we applied JRF for the simultaneous construction of gene co-expression networks and protein co-expression networks. For our analysis, we considered gene-expression from TCGA and protein-expression from CPTAC data for 62 breast cancer patients. We derived the two networks and identified protein-specific hub genes and gene modules showing the unique contribution of proteomic data to breast cancer research.

Materials and Methods Random forest for network construction Random forest is a non-parametric algorithm which models a response variable via a collection of decision trees. Specifically, each decision tree is constructed based on a random subset of training samples. And the splitting variable at each node of a decision tree is chosen from a randomly sampled subset of predictors upon maximizing a certain utility function (e.g., decrease in node impurity) . Recently, Huynh-Thu et al. (2009) introduced GENIE3, a random forest based model for inferring gene regulatory networks (GRNs). In GENIE3, first, for each target gene k, its expression is modeled as a function of the expression of all other genes via random forest. Then regulatory event (j → k) is measured by Ij→k , the importance score of gene j in the random forest model of gene k. Specifically, the importance score Ij→k is defined as the 4

ACS Paragon Plus Environment

Page 4 of 33

Page 5 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

summation of node impurities across all nodes that utilize predictor j for the splitting rule divided by the total number of trees in the random forest model of gene k.

Joint Random Forest We start with some notations. Denote G as the total number of different classes. For each class g ∈ {1, ..., G}, denote the gene expression data for p genes and ng individuals as Xgng ×p , and the i-th observation of gene j under class g as xgij . An overview of JRF is shown in Figure 1 where, for simplicity, we consider only two classes. The key step of JRF is to construct G random forest models simultaneously using data from G classes, when predicting the expression of a target gene k based on the expression of all other genes. We propose to use the same predictor variables for splitting rules in different trees corresponding to different classes. The goal is to borrow information across different classes, so that regulatory relationships can be better detected if there are coherent signals across different classes. Specifically, when we grow G decision trees in parallel for G classes, at one node τ , we decide the splitting variable based on the following procedure: (1) Randomly sample a set Sτ containing N genes from the entire set of genes except gene k. (2) For each class, the best splitting rule based on each candidate predictor in set Sτ is derived. Let Pgτ be the subset of samples allocated to node τ under class g. The splitting rule based on the jth predictor is of the form

Lτgj (a∗gj ) = {i ∈ Pgτ : xgij > a∗gj }, Rτgj (a∗gj ) = {i ∈ Pgτ : xgij ≤ a∗gj }, where Lτgj and Rτgj are the subsets of samples in class g allocated to the left and right children of node τ with a splitting rule based on the jth predictor. For each predictor 5

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

in set Sτ , the best threshold a∗gj (τ ) is derived as follows τ a∗gj (τ ) = arg max Cgj (ω), ω

where τ Cgj (ω) = v[Pgτ ] − v[Lτgj (ω)] − v[Rτgj (ω)],

and v(P) =

P

x∈P (x − x)

2

τ with x representing the mean of set P. In other words, Cgj

is the decrease in node impurity after splitting node τ in the gth tree according to gene j. (3) Finally, gene gτ∗ is chosen as the splitting variable of node τ for all classes, based on the following criteria: gτ∗ = arg maxj

PG

g=1

τ (a∗ ) Cgj gj ng

Since JRF utilizes the same splitting variable for corresponding nodes across all classes, predictors associated with gene k in multiple classes are more likely to be chosen for the splitting rules. For each step in the tree construction, (1-3) only apply to classes for which τ is not a final node. As in the original random forest model, 21 each tree grows until either the total number of observations allocated to the final leaves falls below a certain pre-specified threshold or the maximum number of possible nodes is reached. It is worth mentioning that, when the number of classes is one, JRF reduces to GENIE3, the original random forest model for network inference. It is worth noting that, in order to implement JRF, data sets need to be standardized to mean zero and unit variance. Importance scores depend on the scale of the data and, therefore, variables (genes) need to be standardized before the random forest models are fitted.

6

ACS Paragon Plus Environment

Page 6 of 33

Page 7 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Assessing JRF performance Based on the above procedure, JRF constructs G random forest models simultaneously for each target gene k and returns a ranking of gene-gene interactions based on importance scores. In order to assess the performance of JRF in predicting the true interactions, receiver operating characteristic curves (ROC) and precision-recall curves can be computed by setting different thresholds on importance scores. In this paper, JRF is compared to JGL, 15 GENIE3-Sep and GENIE3-Comb on several in-silico experiments. GENIE3-Sep is GENIE3 used to estimate networks based on data from G classes separately; while GENIE3-Comb is GENIE3 used to estimate a unique network based on the union of data from all classes. All random forest based algorithms (JRF, GENIE3-Comb and GENIE3-Sep) provide a ranking of regulatory relationships based on importance scores and ROC curves were computed by setting different thresholds on these scores. Instead, for JGL, ROC curves were constructed by considering different values for the two parameters controlling the level of sparsity (as shown by Figure S1 in the supplementary material). Another approach to evaluate the performance of JRF is choosing a proper cutoff value for g importance scores using permutation techniques. Let Ij→k be the importance score associated

to the regulatory event (j → k) in the gth class-specific tree ensemble. Specifically, this P g τ where T is the number of trees and Ngj importance score is defined as Ij→k = T1 τ ∈Ngj Cgj is the set of nodes which utilize gene j for the splitting rule in the tree ensemble used to predict gene k based on the gth class-specific data. Since in this paper we are interested in g estimating undirected networks, we derive importance score Ij−k for every edge (j −k) as the g g . And the final undirected networks can and Ij→k average between importance scores Ik→j

therefore be derived by applying a cutoff on the edge importance scores. To derive a proper cutoff value for importance scores, we utilize the following permutation based procedure: (a) For b ∈ {1, · · · , B}, with B being the number of permutations, (a.1) For any target gene k, we first permute its sample order within each class and fit 7

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 33

G random forest models via JRF to predict the expression of gene k based on the expression of all other genes in G classes respectively. (a.2) Repeat (a.1) for all genes and compute the final importance scores for each edge g(b)

in each class, which are denoted as {Ip×p }G g=1 . (b) For each threshold ι, we compute f (ι) =

1 B

PB P g(b) b=1 j6=k 1(Ij−k >ι) P g j6=k 1(Ij−k >ι)

·

where 1( ) is the indicator function, equal to one if event

· occurs and zero otherwise.

f (ι) can serve as an approximation of the false discovery rate (FDR). 25 In the following numerical studies, we use ι0 = min{ι : f (ι) ≤ 0.001} and declare an edge between j and k g in class g if Ij−k > ι0 .

Computational complexity   P The computational complexity of JRF is O pT N G log(n )n g g , where p is the number g=1 of genes, T is the number of trees in each forest, N is the number of variables sampled at each node, G is the number of classes, and ng is the number of samples in each class. This complexity is in the same order as the complexity of applying GENIE3 to estimate G networks for G classes separately. In the contrast, the time complexity of GGM based approaches, such as JGL, often has the order of p3 . 15 So when p is large, JRF and GENIE3 enjoys better computational efficiency, which is supported by our numerical investigations (see supplementary material). In practice, the number of trees (T ) and the number of potential regulators to be sampled at each node (N ) are user-specified parameters. T is usually chosen sufficiently large, since the tree ensemble provides more accurate results as T increases. In our numerical studies, we reported results for T = 1000. We also evaluated larger T values, such as 2000 and 5000, but √ observed similar results (data not shown). For N , a common choice is N = p − 1 where 8

ACS Paragon Plus Environment

Page 9 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(p − 1) is the number of predictors for each target gene. In general, large value of N results in predictions with high-bias; while low value results in predictions with high-variance. We implemented an R package JRF using C routines wrapped with an R interface. The R package is publicly available through CRAN. Codes and a tutorial for JRF’s implementation can also be downloaded from http://research.mssm.edu/tulab/software/JRF.html. Similarly to other tree-based algorithms for network construction, JRF can be easily parallelized since the estimation process consists of p independent sub-problems. A comparison between JRF, JGL, GENIE3-Sep and GENIE3-Comb in terms of computational time is presented in Table S1.

Data Synthetic data For data generation, we first considered two networks containing 500 nodes connected by 498 and 249 edges respectively. In particular, Network 1 is a network with two disjoint components and Network 2 consists of only the first component of Network 1. In this example, we focused on two network topologies: power law 10,26 and star topology (Figure S2 in supplementary material). For each network topology, we simulated 20 replicates involving n = 50 samples from a Gaussian graphical model. Network structures and covariance matrices were simulated in the same way as Danaher et al. (2014) 15 (further details can be found in supplementary material). Besides this example, we also consider a series of other different simulation settings, including (1) two non-nested networks; (2) five non-nested networks; (3) two non-related networks; and (4) data from non-Gaussian distributions. Please see supplementary material sections 1.3-1.7 for more details. For each data set, variables were standardized to mean zero and unit variance.

9

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 33

Breast cancer data The expressions of 5, 864 proteins for 62 breast cancer tumors 27 (proteome-ratio-norm-noNAnormal.gct(04-Jun-2014 version)) was downloaded from the CPTAC Data Coordinating Center (http://proteomics.cancer.gov/programs/cptacnetwork) sponsored by the National Cancer Institute. For 3 samples having technical duplicates, we took the average abundance of each protein across the duplicates. We further standardized protein expression so that each sample had median 0 and median absolute deviation (MAD) 1. Finally, we excluded proteins with interquartile range less than the 70% quantile. The resulting data matrix consisted of 1, 759 proteins. Level-three RNAseq data of breast tumor samples was obtained from the TCGA web site (http://tcga-data.nci.nih.gov/tcga/). First, we log-transformed the data and then replaced all missing values with 0. Then, we standardized each sample to have median 0 and MAD 1. We focused on the 62 samples contained in the proteomic data and excluded genes with more than 10% missing values. Then, we selected the top 10% genes with largest interquartile range across samples. The resulting data matrix based on RNAseq data contained 1, 464 genes. As final subset, we considered 528 genes obtained by overlapping the set of 1, 464 selected genes and the set of 1, 759 selected proteins. For each data set, genes and proteins were standardized to mean zero and unit variance.

Results and discussion Synthetic data In this section, JRF was compared to JGL, 15 GENIE3-Sep and GENIE3-Comb on several √ in-silico experiments. In all implementations, we used T = 1, 000 and N = p − 1. We evaluated the performances of different algorithms based on different network structures, sample sizes and number of genes.

10

ACS Paragon Plus Environment

Page 11 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Network structure Figure 2 shows the average ROC curves of each method over 20 replicates for Network 1 and Network 2 separately. JRF outperforms all the other methods under all different settings, while GENIE3-Sep is the second best performer. Figure S3 shows the boxplot of the area under the precision-recall curve (AUPR) for JRF, GENIE3-Sep and GENIE3-Comb. As shown, JRF outperforms competitors also in terms of AUPR. The advantage of JRF over GENIE3-Sep is more evident for Network 2, which contains only shared edges. This is expected as JRF should be able to detect common edges with better power. To further assess the performance of the proposed model, we included more simulation scenarios in the supplementary material. In particular, in supplementary material section 1.3, we considered the case of non-nested networks sharing some structure while, in section 1.4, we considered the case where five different networks were estimated simultaneously. As shown by Figure S4 and Figure S5, JRF outperformed competitors under all these simulation scenarios. In addition, it would be interesting to evaluate the performance of JRF when class-specific networks do not share any structure. For this purpose, we simulated two different networks with no common edge and applied different methods on the corresponding data (supplementary material section 1.5). As shown in Figure S6, performance of JRF is very comparable to that of GENIE3-Sep on this simulation example. This suggests that JRF is sufficiently robust and works in the absence of common structure. Moreover, in order to assess the performance of JRF in the presence of non-linearities in the data, we further generated data examples using GeneNetWeaver, 28 an open-source software for the generation of in-silico data from a set of ordinary and stochastic differential equations (supplementary material section 1.6). As shown in the supplemental Figure S7 and S8, JRF again outperformed other algorithms in term of both area under the ROC curve (AUC) and area under precision-recall curve (AUPR). Finally, Figure S9 shows results for different algorithms in estimating networks with 1,000 nodes (further details can be found in supplementary material section 1.7) in which we can see that JRF outperformed competitors in terms of both 11

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

AUC and AUPR. It is worth mentioning that the increased dimensionality did not negatively affect results, suggesting that JRF can handle problem with high dimensionality. Effect of sample size In this section, we assessed the performance of JRF and other algorithms on data sets involving different number of samples. Figure 3 shows boxplots of AUC over 20 replicates resulting from JRF, GENIE3-Sep and GENIE3-Comb for different sample sizes, i.e., n = 50, 100, 200. The AUCs of JRF are significantly larger than that of GENIE3-Sep based on 20 replicates under all sample sizes. As shown in the supplemental Figure S10, JRF outperforms competitors also in terms of AUPR for different samples sizes. As shown, the smaller the sample size, the bigger the improvement of JRF over GENIE3-Sep is. This result is expected, since, as the sample size increases, GENIE3-Sep should have sufficient information to estimate the networks separately. Edge specificity In this section, we compared JRF and GENIE3-Sep regarding their abilities to estimate class-specific edges. We considered the 20 replicate datasets (n = 100), which was simulated based on the power law network topology in the previous section. For both JRF and GENIE3-Sep, edges were declared using the permutation procedure illustrated above with B = 500 and an FDR threshold of 0.001. Table 1(a) shows the performance of JRF and GENIE3-Sep in estimating Network 1 and Network 2. For each network, we reported the minimum and the maximum value of true positive rate (T P R), false positive rate (F P R) and false discovery rate (F DR) across 20 replicates. In particular, these quantities are defined as T P R = T P/(T P + F N ), F P R = F P/(F P + T N ) and F DR = F P/(F P + T P ) where T P is the number of true positives, F P is the number of false positives, T N is the number of true negatives and F N is the number of false negatives. For both networks, JRF achieved better power (T P R) than GENIE3-Sep, while the resulting F P Rs were quite com-

12

ACS Paragon Plus Environment

Page 12 of 33

Page 13 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table 1: (a) T P R, F P R and F DR of Network 1 and Network 2. (b) T P R, F P R and F DR of differential edges (Diff.Net). For each quantity (T P R, F P R and F DR) we show the minimun value (Min) and the maximum value (Max) across 20 replicates. Note that the true discovery rate (T DR) is defined as T DR = (1 − F DR).

(a)

Model

Network

JRF

Net 1 Net 2 Net 1 Net 2 Diff.Net Diff.Net

GENIE3-Sep (b)

JRF GENIE3-Sep

TPR Min Max 0.64 0.69 0.68 0.78 0.62 0.66 0.57 0.67 0.56 0.65 0.59 0.67

FPR Min Max 4e-4 7e-4 2e-4 6e-4 4e-4 6e-4 2e-4 6e-4 1e-4 2e-4 3e-4 4e-4

F DR Min Max 0.13 0.20 0.15 0.26 0.12 0.20 0.11 0.27 0.09 0.18 0.19 0.29

parable. Table 1(b) shows a comparison of the two algorithms regarding the detection of class-specific (differential) edges. Again, we computed the rate of true differential edges and false differential edges and showed the minimum and the maximum value of these quantities over 20 replicates. While the T P Rs for detecting differential edges were similar for the two methods, the F P Rs of GENIE3-Sep were significantly larger than that of JRF. In addition, JRF leads to substantially lower F DRs for detecting differential edges than GENIE3-Sep. This again suggests the merit of joint learning in JRF.

Co-expression RNA and protein networks in breast cancer We applied JRF to construct co-expression networks based on the CPTAC global proteomics data set and TCGA RNAseq data of 62 breast cancer samples. Networks via JRF For simplicity, we will refer to the co-expression networks constructed by JRF from global proteomics data and RNAseq data as Protein-network and RNA-network. At FDR cut-off of 0.001, we detected 687 common edges shared by both networks, 502 edges unique to Protein-network and 382 edges unique to RNA-network. The topologies of the two networks are shown in Figure 4, while the full list of interactions can be found in additional file 1. In-

13

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

terestingly, a few network modules in Protein-network have a high percentage of edges unique to Protein-network. No such structures is observed in RNA-network. Figure 5 highlights genes whose degrees (i.e. total number of edges connecting to one gene) in Protein-network is much higher than that in RNA-network. Three out of the top five genes with more Proteinunique edges — C3, CFB and MPO, are related to immune response functions. It is well known that immune response plays a major role in the patient response to chemotherapy treatments. 29 Some of these genes have been shown to be predictive of survival outcomes of breast cancer patients. For example, C3 and CP are predictive of chemoresistance in breast cancer patients 30 and high-activity of MPO genotypes can enhance efficacy of chemotherapy for early-stage breast cancer. 31 Other remarkable proteins are MMP8 and MMP9. Genes belonging to the MMP family have been linked to cancer progression for their ability to degrade the extracellular matrix and may be implicated in the formation of metastasis. 32,33 We then focused on network modules unique to Protein-network. We derived network modules based on edge betweenness using function “edge.betweenness.community” available in the R package igraph. 34 In Figure 6, we show the Protein-network with gene modules enriched of at least one GO category. The full list of genes contained in each module can be found in additional file 1. Figure 6 shows, for each module, the most enriched GO category and the corresponding Benjamini adjusted p-value. 35,36 Two interesting proteinspecific modules (with more protein-specific edges than shared edges) are C1 and C2. C1 is enriched of “fibroblast growth factor receptor activity”, as shown in Figure 7, genes are more tightly correlated in the proteomic data compared to the RNAseq data. Various studies have investigated the role of the FGFR pathway as a predictive marker for breast cancer. 37,38 As shown in Figure 7, FGFR2 and FGFR3 show a correlation close to one. Recently, Cerliani et al. 39 reported a strong correlation between FGFR2 and FGFR3 based on protein expression mentioning that there is no previous evidence of correlation between these two proteins for breast cancer patients. In addition, the interaction between FGFR2 and FGFR3 is contained in the STRING database 40,41 (see Table S2). Another interesting Protein-specific module

14

ACS Paragon Plus Environment

Page 14 of 33

Page 15 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

is C2. This module is enriched of “extracellular region”. Similarly to module C1, genes in module C2 are higher correlated in the proteomic data compared to the RNAseq data (Figure 7). This cluster contains genes such as APOB, C3, ORM1 and CP that have been recently shown to be predictive of chemoresistance in breast cancer patients. 30 As shown by Figure 7, protein expression of these genes are highly correlated. Another important gene in this module with a high number of connecting edges is APOA1 (the fifth highest connected node according to Figure 5). This gene has been recently identified as a potential target for disease progression using whole-genome trancriptomic and whole proteomic. 42 As shown in Table S2, eleven of the relationships in Figure 7 are contained in the STRING database. These results suggest that proteomic data can provide complementary information to genomics data and help to reveal important biological mechanisms. Comparison with GENIE3-Sep Since we are interested in assessing the utility of a joint learning, JRF was compared to the standard random-forest algorithm which constructs the two networks separately. For both methodologies, we derived undirected networks using the same FDR cut-off (0.001). The Protein-network resulting from GENIE3-Sep can be found in Figure S11 in the supplementary material. As shown, 57% of the edges are shared across Protein- and RNA- networks under JRF, while only 26% are shared under GENIE3-Sep. This finding can be explained by the fact that a joint learning can facilitate the detection of common edges. (a) Validation using GO terms. In this section, we validate our findings using GO categories. In particular, we consider 596 GO terms containing less than 200 genes (the list of GO terms can be found in Table S3, S4, S5 and S6 in the supplementary material). Figure 8(a) shows the number of edges contained in at least one GO term for different FDR cut-off for both Protein-networks. Figure 8(b) shows the enrichment p-value for different FDR cut-off. The enrichment p-value was calculated as follows. First, we constructed a network based on GO terms where an edge (a − b) was drawn between every pair of genes 15

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(a, b) sharing at least one GO term. Then, a contingency table was constructed considering variables “network based on GO Terms” and “network based on proteomic data” with categories “number of edges contained" and “number of edges non-contained" and a fisher exact test was performed to derive enrichment p-values. Figure 8(a) and Figure 8(b) show that JRF results in a network more overlapping with GO terms than the standard random-forest algorithm GENIE3-Sep.

(b) Validation using Esr1 and Gata3 knockdown signatures. To further validate our findings, we overlapped both networks with Esr1 and Gata3 knock-down signatures. We identified siRNA knock-down signatures of key transcription factors (TFs) of breast carcinoma in MCF7 cells1 by accessing gene expression data available under GSE31912 in Gene Expression Omnibus 43 (further details about these signatures can be found in §2 of supplementary material). For each network, a neighborhood was defined by defining a threshold k for the number of edges connecting a particular gene to the target gene of interest (either Esr1 or Gata3). For each network in Figure 8, we considered different values of k and counted the number of knock-out signatures contained in the neighborhood. Figure 8(c) and Figure 8(d) shows the total number of signatures contained in k-size neighborhoods for Esr1 and Gata3, respectively. As shown, also in this case, JRF results in a more enriched network. Figure S12 contains the same comparison for RNA networks.

Discussion In this paper, we developed JRF, a random forest based algorithm for the simultaneous construction of gene co-expression and Protein co-expression networks. JRF is designed to borrow information across different expression data by selecting the same set of predictors (genes) as splitting variables in the random forest model corresponding to each data. In this way, JRF is able to detect common relationships (edges) with better power, and detect differential edges-specific to individual data with less false positives. 16

ACS Paragon Plus Environment

Page 16 of 33

Page 17 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Since in this study we were mainly interested in assessing the unique contribution of proteomic data to breast cancer research, we focused our attention on Protein-specific hub genes and modules. We identified two interesting Protein-specific modules containing potential targets for breast cancer treatment. In addition, we compared our algorithm to the original random-forest algorithm which constructs the two networks separately and showed that our algorithm leads to networks more overlapping with available knock-out signatures and existing GO terms. The current version of JRF requires the feature (protein) space to be the same across different classes. Extension of JRF for handling data involving different sets of variables remains as future work. This problem arises in the protein domain, since protein abundance can be measured for different phosphorylation-site which map to a unique protein. As future work, we will design a model able to jointly estimate networks from phosphorylation-site abundance and RNAseq data, simultaneously. Besides estimating gene-regulatory-network and protein-regulatory-network, as shown in the paper, JRF can be used in many other applications. For example, we can apply JRF to estimate GRNs for the three breast cancer sub-categories (luminal, basal and her2), GRN for different tissues and GRN for cancer and normal tissues. Another interesting direction would be assessing the association between miRNAs-genes and miRNAs-proteins simultaneously. miRNAs are molecules which control the growth and proliferation of a cell. It is well known that the downregulation of some miRNAs may play an important role in the progression of cancer. 44 Therefore, a better understanding of the interactions between miRNA, mRNA and protein expression is crucial to cast light on the potential disease mechanisms of cancer. In this context, JRF can be easily utilized to jointly detect which miRNAs regulate genes and proteins. Another advantage of JRF relies on its computational efficiency. JRF can be easily parallelized since the computation can be divided into p independent sub-problems. On the other hand, for many existing methods such as Bayesian networks and GGM-based

17

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 33

algorithms, it is difficult to parallelize their computation. For this reason, JRF could be more preferred when handling large data sets. The merit of JRF depends on the assumption that networks of different classes share some common structures. In order to assess the performance of JRF in the absence of common structure, we conducted numerical studies to investigate such cases, and concluded that JRF is sufficiently flexible to guarantee good performance even when non-related networks are considered. In JRF, importance scores derived from random forest models are used to rank edge strengths. We propose a permutation based procedure to derive proper cutoffs on importance scores for selecting confident edges. Specifically, we introduce f (ι) as an approximation of the false discovery rate (FDR) of the selected edge set based on cutoff ι. However, in the numerical studies, we observe that f (ι) tends to under estimate FDR. Thus, a conservative threshold on f (ι) is recommended in practice. New methods to obtain more accurate estimate of FDR for JRF warrants future research.

Acknowledgement This work was supported in part through the computational resources and staff expertise provided by the Department of Scientific Computing at the Icahn School of Medicine at Mount Sinai. F.P. and P.W. are partially supported by National Institutes of Health grants (SUBU24CA160034 and SUB-R01GM108711). Z.T. received financial support from Berg Pharma as a consultant and support from National Institutes of Health (U01AG046170,U01HG008451) and Leducq Foundation Transatlantic Networks of Excellence Program grant. P.W. is also supported by National Institutes of Health grants (R01GM082802, sub-P01CA53996). We also acknowledge the expert assistance of Steven A. Carl, Karl R. Clauser, Chenwei Lin, D. R. Mani, Philipp Mertins, Amanda Paulovich, Xianlong Wang and Ping Yan.

18

ACS Paragon Plus Environment

Page 19 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Supporting Information Available Supporting information contains a pdf file and an excel file. The pdf file includes more results from both synthetic data and breast cancer data and is divided into two sections: “Insilico experiments” and “Co-expression RNA and protein networks in breast cancer”. Section “In-silico experiments” contains nine subsection: “JGL parameter specification”, “Network topology”, “Estimation of non-nested networks”, “Estimation of five networks”, “Estimation of non-related networks’, “Gaussian model vs GeneNetWeaver”, “Network dimension”, “Effect of sample size”, “Computational time”. On the other hand, section “Co-expression RNA and protein networks in breast cancer” includes four sub-sections: “String database”, “Protein and RNAseq networks from GENIE3-Sep”, “GO categories” and “Knockout signatures”. The excel file includes the list of gene-gene interactions for both Protein and RNAseq data derived via JRF, the list of genes contained in each of the clusters shown in Figure 6 and the list of interactions obtained via GENIE3-Sep. This material is available free of charge via the Internet at http://pubs.acs.org/.

19

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 1: JRF schematic. For simplicity, let us assume that there are only two classes and that each data contains the same number of samples. For each target gene gk , we run random forest for each class and the two tree ensembles are forced to share some structure as follows. First, for each node τ , the same set of genes Sτ is randomly sampled from the entire set of genes (a). Then, the same variable in set Sτ is chosen for the splitting rule. This is achieved by deriving the splitting rule for any gene contained in set Sτ (b) and choosing the best variable to maximize the sum of decrease in node impurity over different classes (c).

Figure 2: Mean of ROC curves for Network 1 (first column) and Network 2 (second column) over 20 replicates for JRF (red), GENIE3-Comb (green), GENIE3-Sep (blue) and JGL (black). For each replicate, we sampled 50 samples from a Gaussian graphical model based on the power law topology (first row) and star topology (second row).

Figure 3: Boxplot of AUC over 20 replicates for JRF (red), GENIE3-Comb (green) and GENIE3-Sep (blue) for different sample sizes, i.e., n= 50, 100 and 200. Data: samples simulated from a Gaussian graphical model on a power law topology.

Figure 4: Network based on proteomic data (a) and RNAseq data (b) resulting from JRF. Green edges (687) are shared between proteomic and RNAseq data, red edges (502) are unique to proteomic data, while blue edges are unique to RNAseq data (382). Only names of genes with at least ten connecting edges are shown in the plot. For both networks, the full list of interactions can be found in additional file 1

Figure 5: Number of connecting edges for top connected genes in the network based on proteomic data. For each gene, the green bar corresponds to the number of connecting edges shared between proteomic and RNAseq data; the red bar indicates the number of Proteinspecific edges; while the blue bar indicates the number of RNA-specific edges. For each gene, we list its biological functions.

Figure 6: Network based on proteomic data resulting from JRF with corresponding gene modules. Modules were detected using function “edge.betweenness.community” available in the R package igraph. For each module, we show the most enriched GO category with corresponding Benjamini adjusted p-value (P). In particular, only modules reporting a significant enriched Benjamini adjusted p-value (P≤0.01) are highlighted.

20

ACS Paragon Plus Environment

Page 20 of 33

Page 21 of 33

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 22 of 33

Page 23 of 33

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 24 of 33

Page 25 of 33

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 26 of 33

Page 27 of 33

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(4) Zhang, B.; Horvath, S. A general framework for weighted gene co-expression network analysis. Stat. Appl. Genet. Mol. Biol. 2005, 4 . (5) Zhu, J.; Zhang, B.; Smith, E. N.; Drees, B.; Brem, R. B.; Kruglyak, L.; Bumgarner, R. E.; Schadt, E. E. Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks. Nat. Genet. 2008, 40, 854–861. (6) Friedman, N.; Linial, M.; Nachman, I.; Pe’er, D. Using Bayesian networks to analyze expression data. J. Comput. Biol. 2000, 7, 601–620. (7) Schäfer, J.; Strimmer, K. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Stat. Appl. Genet. Mol. Biol. 2005, 4. (8) Yuan, M.; Lin, Y. Model selection and estimation in the Gaussian graphical model. Biometrika 2007, 94, 19–35. (9) Friedman, J.; Hastie, T.; Tibshirani, R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 2008, 9, 432–441. (10) Peng, J.; Wang, P.; Zhou, N.; Zhu, J. Partial correlation estimation by joint sparse regression models. J. Am. Stat. Assoc. 2009, 104, 735–746. (11) Hecker, M.; Lambeck, S.; Toepfer, S.; Van Someren, E.; Guthke, R. Gene regulatory network inference: data integration in dynamic models—a review. Biosystems 2009, 96, 86–103. (12) Lee, W.-P.; Tzou, W.-S. Computational methods for discovering gene networks from expression data. Briefings Bioinf. 2009, 10, 408–423. (13) Zhou, S.; Lafferty, J.; Wasserman, L. Time varying undirected graphs. Machine Learning 2010, 80, 295–319.

28

ACS Paragon Plus Environment

Page 28 of 33

Page 29 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(14) Song, L.; Kolar, M.; Xing, E. P. KELLER: estimating time-varying interactions between genes. Bioinformatics 2009, 25, i128–i136. (15) Danaher, P.; Wang, P.; Witten, D. M. The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2014, 76, 373–397. (16) Guo, J.; Levina, E.; Michailidis, G.; Zhu, J. Joint estimation of multiple graphical models. Biometrika 2011, asq060. (17) Ahmed, A.; Xing, E. P. Recovering time-varying networks of dependencies in social and biological studies. Proc. Natl. Acad. Sci. U. S. A. 2009, 106, 11878–11883. (18) Huynh-Thu, V. A.; Irrthum, A.; Wehenkel, L.; Geurts, P. Inferring regulatory networks from expression data using tree-based methods. PloS one 2010, 5, e12776. (19) Petralia, F.; Wang, P.; Yang, J.; Tu, Z. Integrative random forest for gene regulatory network inference. Bioinformatics 2015, 31, i197–i205. (20) Maduranga, D.; Zheng, J.; Mundra, P. A.; Rajapakse, J. C. Pattern Recognition in Bioinformatics; Springer, 2013; Chapter 2, pp 13–22. (21) Breiman, L. Random forests. Machine learning 2001, 45, 5–32. (22) Yang, P.; Hwa Yang, Y.; B Zhou, B.; Y Zomaya, A. A review of ensemble methods in bioinformatics. Curr. Bioinf. 2010, 5, 296–308. (23) Greenfield, A.; Madar, A.; Ostrer, H.; Bonneau, R. DREAM4: Combining genetic and dynamic information to identify biological networks and dynamical models. PloS one 2010, 5, e13397–e13397. (24) Marbach, D.; Costello, J. C.; Küffner, R.; Vega, N. M.; Prill, R. J.; Camacho, D. M.; Allison, K. R.; Kellis, M.; Collins, J. J.; Stolovitzky, G. Wisdom of crowds for robust gene network inference. Nat. Methods 2012, 9, 796–804. 29

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(25) Tusher, V. G.; Tibshirani, R.; Chu, G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. U. S. A. 2001, 98, 5116–5121. (26) Chen, H.; Sharp, B. M. Content-rich biological network constructed by mining PubMed abstracts. BMC Bioinf. 2004, 5, 147. (27) Mertins, P. et al. Proteogenomic analysis of human breast cancer connects genetic alterations to phosphorylation networks. unpublished work, The Broad Institute of MIT and Harvard July, 2015, (28) Schaffter, T.; Marbach, D.; Floreano, D. GeneNetWeaver: in silico benchmark generation and performance profiling of network inference methods. Bioinformatics 2011, 27, 2263–2270. (29) Andre, F.; Dieci, M. V.; Dubsky, P.; Sotiriou, C.; Curigliano, G.; Denkert, C.; Loi, S. Molecular pathways: involvement of immune pathways in the therapeutic response and outcome in breast cancer. Clin. Cancer Res. 2013, 19, 28–33. (30) Hyung, S.-W.; Lee, M. Y.; Yu, J.-H.; Shin, B.; Jung, H.-J.; Park, J.-M.; Han, W.; Lee, K.-M.; Moon, H.-G.; Zhang, H. A serum protein profile predictive of the resistance to neoadjuvant chemotherapy in advanced breast cancers. Mol. Cell. Proteomics 2011, 10, M111–011023. (31) Ambrosone, C. B.; Barlow, W. E.; Reynolds, W.; Livingston, R. B.; Yeh, I.-T.; Choi, J.Y.; Davis, W.; Rae, J. M.; Tang, L.; Hutchins, L. R. Myeloperoxidase genotypes and enhanced efficacy of chemotherapy for early-stage breast cancer in SWOG-8897. J. Clin. Oncol. 2009, 27, 4973–4979. (32) Thirkettle, S.; Decock, J.; Arnold, H.; Pennington, C. J.; Jaworski, D. M.; Edwards, D. R. Matrix metalloproteinase 8 (collagenase 2) induces the expression of interleukins 6 and 8 in breast cancer cells. J. Biol. Chem. 2013, 288, 16282–16294.

30

ACS Paragon Plus Environment

Page 30 of 33

Page 31 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(33) Zhao, S.; Han, J.; Zheng, L.; Yang, Z.; Zhao, L.; Lv, Y. MicroRNA-203 Regulates Growth and Metastasis of Breast Cancer. Cell. Physiol. Biochem. 2015, 37, 35–42. (34) Csardi, G.; Nepusz, T. The igraph software package for complex network research. InterJournal, Complex Systems 2006, 1695, 1–9. (35) Huang, D. W.; Sherman, B. T.; Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 2008, 4, 44–57. (36) Huang, D. W.; Sherman, B. T.; Lempicki, R. A. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009, 37, 1–13. (37) André, F.; Cortés, J. Rationale for targeting fibroblast growth factor receptor signaling in breast cancer. Breast Cancer Res. Treat. 2015, 150, 1–8. (38) Sun, S.; Jiang, Y.; Zhang, G.; Song, H.; Zhang, X.; Zhang, Y.; Liang, X.; Sun, Q.; Pang, D. Increased expression of fibroblastic growth factor receptor 2 is correlated with poor prognosis in patients with breast cancer. J. Surg. Oncol. 2012, 105, 773–779. (39) Cerliani, J. P.; Vanzulli, S. I.; Piñero, C. P.; Bottino, M. C.; Sahores, A.; Nuñez, M.; Varchetta, R.; Martins, R.; Zeitlin, E.; Hewitt, S. M. Associated expressions of FGFR-2 and FGFR-3: from mouse mammary gland physiology to human breast cancer. Breast Cancer Res. Treat. 2012, 133, 997–1008. (40) Szklarczyk, D.; Franceschini, A.; Kuhn, M.; Simonovic, M.; Roth, A.; Minguez, P.; Doerks, T.; Stark, M.; Muller, J.; Bork, P. The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res. 2011, 39, D561–D568. (41) Parker, B. C.; Engels, M.; Annala, M.; Zhang, W. Emergence of FGFR family gene

31

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

fusions as therapeutic targets in a wide spectrum of solid tumours. The Journal of pathology 2014, 232, 4–15. (42) Cine, N.; Baykal, A. T.; Sunnetci, D.; Canturk, Z.; Serhatli, M.; Savli, H. Identification of ApoA1, HPX and POTEE genes by omic analysis in breast cancer. Oncol. Rep. 2014, 32, 1078–1086. (43) Barrett, T.; Edgar, R. [19] Gene Expression Omnibus: Microarray Data Storage, Submission, Retrieval, and Analysis. Methods Enzymol. 2006, 411, 352–369. (44) Ryan, B. M.; Robles, A. I.; Harris, C. C. Genetic variation in microRNA networks: the implications for cancer research. Nature Reviews Cancer 2010, 10, 389–402.

32

ACS Paragon Plus Environment

Page 32 of 33

Page 33 of 33

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment