Article pubs.acs.org/jpr
Advancing Clinical Proteomics via Analysis Based on Biological Complexes: A Tale of Five Paradigms Wilson Wen Bin Goh*,†,‡ and Limsoon Wong*,‡,§ †
School of Pharmaceutical Science and Technology, Tianjin University, 92 Weijin Road, Nankai District, Tianjin 300072, China Department of Computer Science, National University of Singapore, 13 Computing Drive, Singapore 117417 § Department of Pathology, National University of Singapore, 5 Lower Kent Ridge Road, Singapore 117417 ‡
S Supporting Information *
ABSTRACT: Despite advances in proteomic technologies, idiosyncratic data issues, for example, incomplete coverage and inconsistency, resulting in large data holes, persist. Moreover, because of naiv̈ e reliance on statistical testing and its accompanying p values, differential protein signatures identified from such proteomics data have little diagnostic power. Thus, deploying conventional analytics on proteomics data is insufficient for identifying novel drug targets or precise yet sensitive biomarkers. Complex-based analysis is a new analytical approach that has potential to resolve these issues but requires formalization. We categorize complex-based analysis into five method classes or paradigms and propose an even-handed yet comprehensive evaluation rubric based on both simulated and real data. The first four paradigms are well represented in the literature. The fifth and newest paradigm, the network-paired (NP) paradigm, represented by a method called Extremely Small SubNET (ESSNET), dominates in precision-recall and reproducibility, maintains strong performance in small sample sizes, and sensitively detects low-abundance complexes. In contrast, the commonly used over-representation analysis (ORA) and direct-group (DG) test paradigms maintain good overall precision but have severe reproducibility issues. The other two paradigms considered here are the hit-rate and rank-based network analysis paradigms; both of these have good precision-recall and reproducibility, but they do not consider low-abundance complexes. Therefore, given its strong performance, NP/ESSNET may prove to be a useful approach for improving the analytical resolution of proteomics data. Additionally, given its stability, it may also be a powerful new approach toward functional enrichment tests, much like its ORA and DG counterparts. KEYWORDS: proteomics, networks, bioinformatics, systems biology, translational biology
■
INTRODUCTION
or proteins) from high-throughput biological data sets remains a subject of immense challenge. Feature selection is a statistical procedure commonly used for simplifying models by identifying and including only the most relevant measured biological entities. Features are usually proteins or genes, depending on the nature of the screen (proteomics or genomics). Feature selection is reliant on the use of statistical test procedures, for example, the t test followed by a standard pvalue cutoff of 0.01 or 0.05. Unfortunately, it is becoming clear that standard statistical test procedures are quite unreliable. Part of the issue stems from over-reliance and naiv̈ e interpretation of the test procedure’s p value, which itself has been shown to be highly unstable.6 Consequently, the selected features in one screen are not reproducible in another screen on
Recent advances in hardware have made proteomics increasingly relevant toward clinical deployment. These advances include powerful protein extraction procedures, for example, PCT,1 brute-force spectra-capture methods, for example, SWATH,2 and improved multiplexing technologies.3 Although proteomics still lags behind genomics in terms of throughput, these technological increments are of critical importance: Assaying proteins provides direct information on the underlying molecular landscape (expression levels and post-translational modifications), a crucial limitation that next-generation sequencing cannot address. Naturally, these advances lead to improvements in proteomics data quality (better quantitation, larger sample sizes, more proteins identified, etc.)4 Unfortunately, idiosyncratic coverage and consistency issues persist.5 Moreover, selecting reproducible and relevant features (differential genes © XXXX American Chemical Society
Received: May 4, 2016
A
DOI: 10.1021/acs.jproteome.6b00402 J. Proteome Res. XXXX, XXX, XXX−XXX
Article
Journal of Proteome Research
Of the five paradigms, ORA and DG are the most traditional and well-known. ORA is sometimes known as overlap analysis and involves first the identification of a list of differential proteins, followed by a statistical enrichment test. The hypergeometric enrichment (HE) pipeline, which we also use here, is perhaps the most known instance of ORA. HE involves two steps: Differential-protein selection via the two-sample t test followed by a test for significance of overlap with some reference protein complex via the hypergeometric test. Its main shortcoming is that it is very sensitive to the statistical methods and thresholds used for generating the differential-protein set.32 DG avoids the shortcoming of ORA by testing whether a protein complex is differential as a whole. This is achieved by evaluating whether the distribution of expressions of all constituent proteins of that complex is different from proteins not in the complex. DG methods include functional class scoring (FCS)28,33 and gene set enrichment analysis (GSEA).9 There are several variants of both GSEA and FCS: The original GSEA works by first ranking all proteins based on t statistic. This is followed by the Kolmogorov−Smirnov (KS) test to check whether the ranks of proteins in the reference complex and the ranks of proteins outside the reference complex come from the same distribution. The significance of the KS test statistic is evaluated against a null distribution obtained by permuting patient phenotype labels. The current GSEA replaces the KS test with a weighted KS test, where lesser weights are given to the lower ranks. Similarly, the original FCS33 works by first computing the log-transformed t-statistic p value of all proteins. Then, it checks the average logtransformed t-statistic p value of proteins in the reference complex against a null distribution of average log-transformed tstatistic p values of random sets of proteins of the same cardinality as the reference complex.33 A refinement of FCS uses a null distribution formed by random permutation of patient phenotype labels.34 Here, as a classic representative of the DG paradigm, we use the original GSEA, based on the Kolmogorov−Smirnov (KS) test; however, we use the theoretical Kolmogorov distribution as the null distribution because of the small sample sizes in this study (and also because all other methods examined here use theoretical distributions instead of resampling statistics). HR is initially conceived to address inconsistency issues in data-dependent acquisition (DDA)-proteomics. Its defining characteristic involves calculation of overlaps (or hit-rates) among detected proteins against a vector of complexes to generate a hit-rate vector, which is used for class discrimination and feature selection. There are two variants of HR: Proteomics signature profiling (PSP)35 and its newer quantitative variant QPSP.8 On renal cancer (RC), QPSP provides high stability and prediction of relevant features over existing methods, including HE.8 It is also able to detect subpopulations (disease severity outcome) within the cancer group. Because QPSP is designed to work with newer proteomics data sets, we use QPSP. RBNA is a class of methods that incorporates rank-weights onto measured proteins based on their individual sample ranks and class-weights based on the proportion of supporting information among samples. It greatly improves signal-to-noise ratios in a series of noisy blood cancer genomics data sets.30 There are currently three members in the RBNA family: SubNET (SNET), fuzzy SNET (FSNET), and paired FSNET (PFSNET).29,30 PFSNET is the newest and most powerful member in this class of techniques, so we use PFSNET.48
the same biological phenotype. Venet et al. confirmed this in their studies of breast cancer signatures. They further went on to show that randomly generated gene signatures do as well as, if not better than, real ones inferred from previous data sets on predictive performance.7 Thus, the selected features, besides lacking reproducibility, might also be lacking relevance. Reproducibility and relevance problems may be resolved via the integration of contextual information with proteomics data.8 Contextual information refers to prior knowledge on biological function and includes gene groups,9 biological complexes,10−12 and biological networks.13,14 Gene groups are assemblies of genes, which are consistently associated with each other and may have common functionalities; existence of physical interactions among their protein products is nonessential. Biological complexes are known physical assemblies of proteins, which likely serve a common function; physical interactions are expected, but the exact binding configuration is not of penultimate importance. There are many types of biological networks, which include protein− protein interaction, regulatory, signaling, and metabolic.5,15,16 Full networks tend to be rather large and impractical for indepth analysis, so clustering algorithms are commonly used alongside these to identify densely connected components (clusters). Unfortunately, biological networks tend to be rather dispersed and incomplete.17 Moreover, there is a dizzying array of clustering algorithms that all give rise to different cluster sets.18 In any case, the validity or meaningfulness of the predicted network clusters is usually evaluated by comparison against real complexes, given the reasoning that a good predicted cluster set should resemble a set of real complexes to be enriched for biological signal.18 In regard to this, biological complexes may be thought of as a gold standard of networkpredicted clusters.19−21 Moreover, in our recent benchmarks, it turns out that biological complexes strongly outperform predicted clusters during practical application anyway.22,23 Recently, we have advocated the use of biological complexes for studying proteomics data sets for a number of reasons.24 First, information on biological complexes tends to be highly centralized, for example, CORUM10−12 for human complexes and MIPS25,26 for yeast. Second, biological complexes exhibit high biological signal enrichment.8,23 Third, they have the advantage of being an immutable reference; they do not need to be derived using clustering algorithms or reference networks, which yield different results depending on which ones are used. Lastly, standardizing complexes is much easier. For example, in terms of representation, a complex is concisely just a list of its constituent proteins (and standard naming convention for proteins exists); the exact topological configuration among constituent proteins is not of paramount importance.24 Unfortunately, a formalization of the classes of methods for complex-based analysis in proteomics has never been done before, and a suite of reasonable benchmarks does not exist. To fill the first of these two gaps, we categorize current complexbased analysis methods into five paradigms (or method classes): Four of these paradigms are known and have previously been applied on proteomics data. These four paradigms are over-representation analysis (ORA),27 directgroup (DG),28 hit-rate (HR),8 and rank-based network analysis (RBNA).29,30 We take the opportunity to introduce a new fifth paradigm, network-paired (NP),31 which has never been studied with proteomics data prior to this work. For succinctness, a representative method from each paradigm is tested. B
DOI: 10.1021/acs.jproteome.6b00402 J. Proteome Res. XXXX, XXX, XXX−XXX
Article
Journal of Proteome Research
excluded from this analysis. Proteins are quantified via spectral count, which is the total number of MS/MS spectra acquired for peptides from a given protein.
NP is the newcomer among these analytical paradigms. It is rigorously tested in genomics but never on proteomics data. Currently, there is only one instance in NP, extremely small SubNET (ESSNET), which works by testing the distribution of paired class differences on a given set of subnets or complexes.31 ESSNET is demonstrably powerful and, in comparative analysis across a variety of gene expression data sets, shown to outperform other techniques in terms of reproducibility, including the RBNAs31 (see Materials and Methods for details of each method). To fill the second gap, which concerns benchmarks, we design evaluation criteria on two levels: Precision and recall on simulated data (where true positives are known a priori but may not perfectly model real data) and reproducibility and stability on real data sets (true positives are not known a priori but are real biological data sets). We purport that a perfect network method should exhibit strong precision and recall (where we may satisfactorily infer these) and also strong reproducibility (on real data), especially on small data sets.36 The last criterion is particularly important in clinical deployment, given relative sample scarcity and proteomic platform scalability limits.
■
Real Biological Complexes
Although subnets or clusters can be predicted from large biological networks, our previous works demonstrated that real biological complexes are significantly enriched for biological signal, far outperforming predicted complexes/subnets from reference networks.23 Also, in other contexts, complexes have been shown to be the best predictors of phenotype given a multitude of predictors in yeast (expression correlation, genetic interactions, etc.).22 Hence, we use known human protein complexes derived from the CORUM database.10 Because some CORUM complexes are very small, these may give rise to high fluctuation in the test statistics used by the methods considered here. Thus, for inclusion in our feature vector, we use only protein complexes of at least size 5 (600 out of 1323 complexes meet this criterion). Hypergeometric Enrichment
Hypergeometric enrichment (HE) is the most well-known example of ORA. HE comprises two parts:5 First, a t-statistic (Tp) is calculated for each protein p by comparing the expression scores between phenotype classes A and B, with the assumption of unequal variance between the two classes,39 as given by
MATERIALS AND METHODS
Simulated Proteomics Data: D1.2 and D2.2
Two simulated data sets from the study of Langley and Mayr are used.37 D.1.2 is from LC−MS/MS study of proteomic changes resulting from the addition of exogenous matrix metallo-peptidase (three control, three test). D2.2 is from a study of hibernating arctic squirrels (four control, four test). Quantitation in both studies is based on spectral counts. Both D1.2 and D2.2 comprise 100 simulated data sets with 20% randomly generated significant features. For D1.2 and D2.2, this corresponds to 177 and 710 significant proteins, respectively. The effect sizes of these 20% differential features are sampled from five possibilities or p (20, 50, 80, 100, and 200%), increased in only one class and not in the other, and expressed as
Tp =
xA̅ − xB̅ sA2 nA
+
sB2 nB
where x#̅ is the mean abundance level of the protein p, s# is the standard deviation, and n# is the sample size, in class #. The degree-of-freedom (DOF) is determined using the Welch− Satterweite equation. The DOF is used for determining the shape of the nominal t distribution. Tp is compared against this nominal t-distribution to determine the corresponding p value. A protein is differential between classes A and B if p value ≤ alpha (where alpha = 0.05). Next, the set of differential proteins is compared against a vector of complexes. Given a total number of proteins N, with M of these belonging to a complex and n of these proteins in the differential set, the probability P that b or more proteins from the differential set are associated by chance with the complex is given by
SCi , j′ = SCi , j*(1 + p)
where SCi,j and SCi,j′ are, respectively, the original and simulated spectral count from the jth sample of protein i. Renal Cancer Control
The renal cancer control (RCC) data set, from the study of Guo et al.,1 comprises 12 SWATH runs originating from a human kidney test tissue digested in quadruplicate (×4) and each digest analyzed in triplicate (×3) using a tripleTOF 5600 mass spectrometer (AB Sciex). 2331 proteins are quantified across the 12 SWATH maps with a peptide and protein false discovery rate (FDR) of 1%. Because just one true sample class exists, RCC may be used for false-positive (FP) rate evaluation.
⎛
min(n , M )
P(X ≥ b) =
∑ i=b
⎞
(ni)⎜⎝ NM −− ni ⎟⎠ ⎛N⎞ ⎜ ⎟ ⎝M⎠
The complex is significant if P(X ≥ b) ≤ alpha (where alpha = 0.05).
Renal Cancer
Direct-Group Analysis Based on the Kolmogorov−Smirnov Test
The RC study of Guo et al.1 comprises 24 SWATH runs originating from six pairs of nontumorous and tumorous clearcell renal carcinoma (ccRCC) tissues, in duplicate. All SWATH maps are analyzed using OpenSWATH38 against a spectral library containing 49 959 reference spectra for 41 542 proteotypic peptides from 4624 reviewed SwissProt proteins.1 The library is compiled via library search of spectra captured in DDA mode (linking spectra mz and rt coordinates to a library peptide). Protein isoforms and protein groups are
Direct-group (DG) analysis methods determine if a complex is relevant by comparing the distribution of constituent protein expression between phenotype classes against that of proteins outside the complex. DG is typically used for evaluation of functional gene groups, a famous example being gene set enrichment analysis (GSEA), which uses a weighted Kolmogorov−Smirnov (KS) statistic.9 C
DOI: 10.1021/acs.jproteome.6b00402 J. Proteome Res. XXXX, XXX, XXX−XXX
Article
Journal of Proteome Research Here the two-sample KS test is used to evaluate if the distribution of ranks based on the t-statistic of proteins in a complex differs from that of proteins outside the complex. Denoting proteins in the complex as the set C and proteins outside the complex as the set C′, the KS-statistic KSC,C′ is expressed as
SHA , HB =
If the t-statistic is significant (i.e., its associated p value is lower than 0.05), then C is differentially expressed between classes A and B. Paired Fuzzy SubNET (PFSNET)
KSC , C ′ = max x|F1, C(x) − F2, C ′(x)|
In PFSNET, given a protein gi and a tissue pk, fs(gi,pk) is assigned a value between 1 and 0 as follows: fs(gi,pk) is assigned the value 1 if gi is among the top alpha1% (default =10%) of the most-abundant proteins in pk. It is assigned the value 0 if gi is not among the top alpha2% (default = 0%) most-abundant proteins in pk. The range between alpha1% and alpha2% is divided into n equal-sized bins (default n = 4), and fs(gi,pk) is assigned the value 0.8, 0.6, 0.4, or 0.2 depending on which bin gi falls into in pk. Given a protein gi and a class of tissues A, let
where F1,C(x) and F2,C′(x) are, respectively, the fraction of proteins in C and C′, whose rank is higher than the rank x. The null hypothesis is rejected at an alpha of 0.05 if KSC , C ′ ≥ c(alpha)*
|C| + |C′| |C|*|C′|
where c(alpha) is the critical value at a given alpha level. Here, at an alpha of 0.05, c(alpha) = 1.36. Quantitative Proteomics Signature Profiling
β (g i , A ) =
Quantitative proteomics signature profiling (QPSP) assumes that higher abundance proteins are measured more reliably. QPSP first sorts proteins in each sample based on their measured abundance and selects the most abundant proteins above a certain percentile (the value is denoted as alpha1). A second percentile value (defined as alpha2) is set to extend the list of proteins for improving sensitivity while maintaining precision.29 To penalize lower-ranked proteins, we assigned proteins different weights depending on which percentiles they fall under. By default, alpha1 and alpha2 are set as the top 10% and the top 10−20% proteins, respectively. Rank-based weighting is achieved via discretization of the continuous range from the top 10−20% into four bins: 10−12.5% (weight 0.8), 12.5−15% (0.6), 15−17.5% (0.4), and 17.5−20% (0.2). All other proteins outside of alpha2 (viz. the remaining proteins) have a weight of 0 (and are thus ignored). Alpha1 proteins are assigned a full weight of 1. Proteins with nonzero weight are referred to as alpha proteins. For each tissue, a vector of hit-rates is produced by considering the overlaps of the alpha proteins against a vector of complexes. Suppose we have a tissue from class A (SA) and a vector of complexes of length n. For each complex Ci in the complex vector, the hit-rate is the intersection of proteins in SA and Ci, modulated by the proteins’ weights (i.e., an alpha1 protein counts as 1 in the intersection, an alpha2 protein of weight 0.8 counts as 0.8 in the intersection, and so on), over the total number of proteins in Ci. We let the hit-rate for Ci in SA be H(SA, Ci). Therefore, the vector of hit-rates for tissue SA is HSA = ⟨H(SA,C1), ..., H(SA,Cn)⟩. This vector of hit-rates constitutes the tissue’s proteomics signature profile. To perform feature selection (where a feature is a protein complex in this instance), for each complex C in the complex vector, we compare two lists against each other, HA = ⟨H(A1,C), ..., H(Am,C)⟩ and HB = ⟨H(B1,C), ..., H(Bn,C)⟩, where A and B are phenotype classes of sizes m and n, respectively. The t statistic between HA and HB is computed by the standard formula t score =
1 n
+
∑
fs(gi , pk )
pk ∈ A
|A |
That is, β(gi,A) is the proportion of tissues in A that have gi among their top alpha percent most-abundant proteins. Let score(C,pk,A) be the score of a protein complex C and a tissue pk weighted based on the class A. It is defined as score(C , pk , A) =
∑
fs(gi , pk ) × β(gi , A)
gi ∈ C
PFSNET defines a score delta(C,pk,X,Y) for a complex C and tissue pk with respect to classes X and Y as the difference of the score of C and tissue pk weighted based on X from the score of C and tissue pk weighted based on Y. More precisely delta(C , pk , X , Y ) = score(C , pk , X ) − score(C , pk , Y )
If a complex C is irrelevant to the difference between classes X and Y, the value of delta(C,pk,X,Y) is expected to be around 0. Thus, PFSNet defines the following one-sample t statistic fPFSNET (C , X , Y , Z) =
mean(C , X , Y , Z) se(C , X , Y , Z)
where mean(C, X, Y, Z) and se(C, X, Y, Z) are, respectively, the mean and standard error of the list { delta(C,pk,X,Y) | pk is a tissue in Z}. The complex C is considered significantly consistently highly abundant in X but not in Y if f PFSNet(C, X, Y, X ∪ Y) is at the largest 5% extreme of the Student’s t distribution. Given two classes A and B, the set of significant complexes returned by PFSNet is the union of {C | f PFSNet(C,A,B,A ∪ B) is significant} and {C | f PFSNet(C,B,A,A∪ B) is significant}, the former being complexes that are significantly consistently highly abundant in A but not B and vice versa. Extremely Small SubNET (ESSNET)
For a given complex with at least five overlaps (see Tweaking ESSNET on Real Data: Determining a Minimum Overlap Threshold section) with the identified protein list, if the complex is irrelevant to the distinction between phenotypes A (e.g., control) and B (e.g., disease), the paired differences of the expression values of any protein in this complex in any pair of tissues of A and B should theoretically be very small. For a complex of size k, n patients in phenotype A and m patients in phenotype B, there are n × m possible pairs of
HA − HB SHA , HB
(m − 1)SHA 2 + (n − 1)SHB2 m+n−2
1 m
where D
DOI: 10.1021/acs.jproteome.6b00402 J. Proteome Res. XXXX, XXX, XXX−XXX
Article
Journal of Proteome Research differences for each of the k proteins. Thus, if the complex is irrelevant, these k × n × m paired differences should be centered on 0. This may be evaluated statistically using a onesample t test where the t statistic (Ts) is expressed as
Ts =
ESSNET on Real Data: Estimating the Appropriate Degrees-ofFreedom section). Performance Benchmarking on Simulated Data
In simulated data, the significant proteins are known a priori and can be used for constructing significant artificial complexes at various levels of purity (proportion of significant proteins in the complex). Proteins residing in the same complex tend to be expressionally correlated. To incorporate this principle, we calculated a Euclidean distance for all significant protein pairs across all samples. These are then clustered via Ward’s linkage. The significant proteins are reordered such that those with similar expression pattern are adjacent to each other. This reordered list is then split at regular intervals to generate 20 and 101 complexes for D1.2 and D2.2, respectively. An equal number of nonsignificant proteins is randomly selected, reordered based on expressional correlation, and then split to generate an equal number of nonsignificant complexes. We may alter the purity of the simulated complexes by reducing the proportion of significant proteins within them. In practice, we seldom observe all complex members being differentially expressed at the same time, which renders it too easy for detection. Thus, reducing purity permits the evaluation of the robustness and sensitivity of the complex-based analysis methods. Here purity is tested at three levels: 100, 75, and 50%. At 100% purity, simulated complexes are composed solely of significant proteins; at 75% purity, 25% of the constituent significant proteins are randomly replaced with nonsignificant ones and so on. The significant and nonsignificant simulated complexes are combined into a single vector. We may evaluate the network methods based on precision and recall, which are defined as
μ s/ 2 M
where μ and s are the means and standard deviation of the paired differences and M = k × n × m. The p value is determined nominally against a theoretical distribution. The complex is considered relevant if the area above the observed Ts in the nominal distribution is μ), its associated p value can be inferred from the nominal z distribution, where the p value is the area under the curve greater than Z (Figure 6B). As overlap layer increments, the associated p values become more significant, in particular, J
DOI: 10.1021/acs.jproteome.6b00402 J. Proteome Res. XXXX, XXX, XXX−XXX
Article
Journal of Proteome Research
Figure 6. (A) Overlaps of significant complexes at each overlap layer (from 2 to 5) on RC. The layers refer to the minimum protein overlap required between the identified RC proteins and the CORUM complexes. (B) Comparison of number of significant complexes using control data (RCC) and real renal cancer data (RC). The number of significant complexes observed for real data is significantly higher than control data where there are no real class differences. However, the p values become considerably more significant as the overlap level increases from 1 to 5. We estimate the optimal minimum overlap threshold to lie between 4 and 5.
consider biological process (BP) terms because the molecular function (MF) and cellular localization (CL) terms are typically of limited clinical interest. (not all proteins are enzymes, and there are not as many cellular locations as there are biological processes.) Because the GO terms follow a hierarchical relationship, there are high interterm dependencies that obfuscate analysis. To reduce these effects, we filtered the BP list to include only informative GO terms using the following rule: An informative GO term is annotated to at least 30 proteins, and none of its descendant terms are annotated to at least 30 proteins (439 informative GO-BP terms).43 Because ESSNET is the best among the newer set of network-test approaches, we just use this for comparison against HE and DG, which are traditionally used for functional-enrichment analysis. Again (cf. Figure 2A), it is obvious that the replicate agreement rate based on ESSNET is significantly higher than both HE and DG (Table 2). Just as the newer network-test methods are useful for isolating good-quality protein complexes, they can also be used for stably recovering functional terms relevant to the phenotype. Many of the terms reported in DG and HE are also found in ESSNET: Agreement rate between DG and ESSNET is 63%, while it is 70% between ESSNET and HE. HE forms the basis of many popular functional-enrichment tools, for example, GO-TermFinder,44 GO-EAST,45 and DAVID,46 while DG is used as the basis of the popular gene set enrichment analysis.9 Given reproducibility issues, it is advisible to consider carefully the implications when using HE or DG for functional-enrichment testing. There is potential for NP methods to become a new standard for functional enrichment analysis; however, this needs to be examined more thoroughly (GO-term stability and GO-term relevance to the phenotype) in future work.
Table 1. Performance Differences Based on Conservative and Relaxed Degrees of Freedoma DOF 0.5 × k × (n + m)
n+m
data set
sampling size
4
6
8
4
6
8
RC
precision recall
0.97 0.77
0.98 0.87
0.98 0.92
1.00 0.20
1.00 0.38
0.98 0.50
a
We evaluated the precision and recall differences based on samplings of size 4, 6, and 8 on the renal cancer (RC) dataset. Precision is maintained, but recall suffers dramatically when DOF is too conservative. This also implies that proteins in the same complex are not perfectly coregulated.
perfectly coregulated,41 and incorrectly assuming perfect coregulation leads to great loss in sensitivity. Network Approaches as Annotation Strategies
While HE and DG are commonly used as a means of functional annotation, our analyses (simulation and real data) reveal serious stability issues with these methods. Gene Ontology (GO) terms are essentially groups with proteins with same function and thus, analogous to complexes. We may replace complexes with GO terms taken from the GO consortium (http://geneontology.org/)42 and check agreement levels between the technical replicates (Table 2). Note that we only Table 2. Number of Significant GO-Biological Process Terms Per Technical Replicate and Overlaps number of terms
HE
ESSNET
DG
replicate 1 replicate 2 overlaps
9 3 0.2
139 127 0.89
1 9 0.11 K
DOI: 10.1021/acs.jproteome.6b00402 J. Proteome Res. XXXX, XXX, XXX−XXX
Article
Journal of Proteome Research
■
(2) Gillet, L. C.; Navarro, P.; Tate, S.; Rost, H.; Selevsek, N.; Reiter, L.; Bonner, R.; Aebersold, R. Mol. Cell. Proteomics 2012, 11, O111 016717. (3) McAlister, G. C.; Huttlin, E. L.; Haas, W.; Ting, L.; Jedrychowski, M. P.; Rogers, J. C.; Kuhn, K.; Pike, I.; Grothe, R. A.; Blethrow, J. D.; Gygi, S. P. Anal. Chem. 2012, 84, 7469. (4) Zhang, B.; Wang, J.; Wang, X.; Zhu, J.; Liu, Q.; Shi, Z.; Chambers, M. C.; Zimmerman, L. J.; Shaddox, K. F.; Kim, S.; Davies, S. R.; Wang, S.; Wang, P.; Kinsinger, C. R.; Rivers, R. C.; Rodriguez, H.; Townsend, R. R.; Ellis, M. J.; Carr, S. A.; Tabb, D. L.; Coffey, R. J.; Slebos, R. J.; Liebler, D. C.; Nci, C. Nature 2014, 513, 382. (5) Goh, W. W.; Lee, Y. H.; Chung, M.; Wong, L. Proteomics 2012, 12, 550. (6) Halsey, L. G.; Curran-Everett, D.; Vowler, S. L.; Drummond, G. B. Nat. Methods 2015, 12, 179. (7) Venet, D.; Dumont, J. E.; Detours, V. PLoS Comput. Biol. 2011, 7, e1002240. (8) Goh, W. W.; Guo, T.; Aebersold, R.; Wong, L. Biol. Direct 2015, 10, 71. (9) Subramanian, A.; Tamayo, P.; Mootha, V. K.; Mukherjee, S.; Ebert, B. L.; Gillette, M. A.; Paulovich, A.; Pomeroy, S. L.; Golub, T. R.; Lander, E. S.; Mesirov, J. P. Proc. Natl. Acad. Sci. U. S. A. 2005, 102, 15545. (10) Ruepp, A.; Brauner, B.; Dunger-Kaltenbach, I.; Frishman, G.; Montrone, C.; Stransky, M.; Waegele, B.; Schmidt, T.; Doudieu, O. N.; Stumpflen, V.; Mewes, H. W. Nucleic Acids Res. 2007, 36, D646. (11) Ruepp, A.; Waegele, B.; Lechner, M.; Brauner, B.; DungerKaltenbach, I.; Fobo, G.; Frishman, G.; Montrone, C.; Mewes, H. W. Nucleic Acids Res. 2010, 38, D497. (12) Ruepp, A.; Waegele, B.; Lechner, M.; Brauner, B.; DungerKaltenbach, I.; Fobo, G.; Frishman, G.; Montrone, C.; Mewes, H. W. Nucleic Acids Res. 2010, 38, D497. (13) Rolland, T.; Tasan, M.; Charloteaux, B.; Pevzner, S. J.; Zhong, Q.; Sahni, N.; Yi, S.; Lemmens, I.; Fontanillo, C.; Mosca, R.; Kamburov, A.; Ghiassian, S. D.; Yang, X.; Ghamsari, L.; Balcha, D.; Begg, B. E.; Braun, P.; Brehme, M.; Broly, M. P.; Carvunis, A. R.; Convery-Zupan, D.; Corominas, R.; Coulombe-Huntington, J.; Dann, E.; Dreze, M.; Dricot, A.; Fan, C.; Franzosa, E.; Gebreab, F.; Gutierrez, B. J.; Hardy, M. F.; Jin, M.; Kang, S.; Kiros, R.; Lin, G. N.; Luck, K.; MacWilliams, A.; Menche, J.; Murray, R. R.; Palagi, A.; Poulin, M. M.; Rambout, X.; Rasla, J.; Reichert, P.; Romero, V.; Ruyssinck, E.; Sahalie, J. M.; Scholz, A.; Shah, A. A.; Sharma, A.; Shen, Y.; Spirohn, K.; Tam, S.; Tejeda, A. O.; Trigg, S. A.; Twizere, J. C.; Vega, K.; Walsh, J.; Cusick, M. E.; Xia, Y.; Barabasi, A. L.; Iakoucheva, L. M.; Aloy, P.; De Las Rivas, J.; Tavernier, J.; Calderwood, M. A.; Hill, D. E.; Hao, T.; Roth, F. P.; Vidal, M. Cell 2014, 159, 1212. (14) Yook, S. H.; Oltvai, Z. N.; Barabasi, A. L. Proteomics 2004, 4, 928. (15) Bensimon, A.; Heck, A. J.; Aebersold, R. Annu. Rev. Biochem. 2012, 81, 379. (16) Goh, W. W.; Wong, L. Curr. Opin. Biotechnol. 2013, 24, 1122. (17) Zhou, H.; Jin, J.; Zhang, H.; Yi, B.; Wozniak, M.; Wong, L. BMC Syst. Biol. 2012, 6 (Suppl 2), S2. (18) Srihari, S.; Yong, C. H.; Patil, A.; Wong, L. FEBS Lett. 2015, 589 (19), 2590. (19) Clancy, T.; Hovig, E. Proteomics 2014, 14, 24. (20) Li, S. S.; Xu, K.; Wilkins, M. R. J. Proteome Res. 2011, 10, 4744. (21) Malovannaya, A.; Lanz, R. B.; Jung, S. Y.; Bulynko, Y.; Le, N. T.; Chan, D. W.; Ding, C.; Shi, Y.; Yucer, N.; Krenciute, G.; Kim, B. J.; Li, C.; Chen, R.; Li, W.; Wang, Y.; O’Malley, B. W.; Qin, J. Cell 2011, 145, 787. (22) Fraser, H. B.; Plotkin, J. B. Genome biology 2007, 8, R252. (23) Goh, W. W.; Sergot, M. J.; Sng, J. C.; Wong, L. J. Proteome Res. 2013, 12, 2116. (24) Goh, W. W.; Wong, L. Trends in Biotechnology 2016, DOI: 10.1016/j.tibtech.2016.05.015. (25) Mewes, H. W.; Amid, C.; Arnold, R.; Frishman, D.; Guldener, U.; Mannhaupt, G.; Munsterkotter, M.; Pagel, P.; Strack, N.;
CONCLUSIONS Comparative analysis of five complex-based paradigms on both simulated and real data reveals that the NP approach, ESSNET, dominates in almost every aspect, from precision-recall in simulated data, to reproducibility and stability on real data. It remains stable in terms of feature-selection stability in small data sets. Additionally, because it does not use any form of abundance thresholds, it can be explicitly used for assaying lowabundance complexes. These qualities suggest that ESSNET is a suitable method for analyzing clinical proteomics data and can play important roles in biomarker and drug-target discovery. PFSNET and ESSNET were initially developed to work with networks and pathways in the genomics context;29,47 however, protein complexes can be regarded as a special case of biological networks and pathways, and they can often be used in lieu of each other. Here, and in previous work,23 we used protein complexes (instead of subnets derived in some way from biological networks) and found that complexes are enriched for biological signal and provides high-quality results in the proteomics context. Nevertheless, the set of known complexes is very incomplete, and to obtain deeper analytical insight and expand the variety of biological functionalities represented therein, it may still be useful to augment the feature vector with high-quality predicted complexes from biological networks.
■
ASSOCIATED CONTENT
S Supporting Information *
The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jproteome.6b00402. Supplementary Figure 1. Single protein (SP) precisionrecall performance on D1.2. Supplementary Figure 2. Protein abundances of complexes found by both PFSNET and ESSNET and the single missed complex. Supplementary Data 1. List of significant complexes based on its CORUM identifier and accompanying pvalues identified in RC by each method. (PDF)
■
AUTHOR INFORMATION
Corresponding Authors
*W.W.B.G.: E-mail:
[email protected]; goh.informatics@ gmail.com. Tel: +86-22-27401021. *L.W.: E-mail:
[email protected]. Tel: +65-65162902. Author Contributions
W.W.B.G. and L.W. designed the bioinformatics methods and pipelines, performed implementation, and wrote the manuscript. Notes
The authors declare no competing financial interest.
■
ACKNOWLEDGMENTS This work is supported by an education grant from Tianjin University, China to W.W.B.G. and a Singapore Ministry of Education tier-2 grant, MOE2012-T2-1-061 to L.W.
■
REFERENCES
(1) Guo, T.; Kouvonen, P.; Koh, C. C.; Gillet, L. C.; Wolski, W. E.; Rost, H. L.; Rosenberger, G.; Collins, B. C.; Blum, L. C.; Gillessen, S.; Joerger, M.; Jochum, W.; Aebersold, R. Nat. Med. 2015, 21, 407. L
DOI: 10.1021/acs.jproteome.6b00402 J. Proteome Res. XXXX, XXX, XXX−XXX
Article
Journal of Proteome Research Stumpflen, V.; Warfsmann, J.; Ruepp, A. Nucleic acids research 2004, 32, 41D. (26) Mewes, H. W.; Frishman, D.; Mayer, K. F.; Munsterkotter, M.; Noubibou, O.; Pagel, P.; Rattei, T.; Oesterheld, M.; Ruepp, A.; Stumpflen, V. Nucleic Acids Res. 2006, 34, D169. (27) Huang, D. W.; Sherman, B. T.; Lempicki, R. A. Nat. Protoc. 2008, 4, 44. (28) Sivachenko, A. Y.; Yuryev, A.; Daraselia, N.; Mazo, I. J. Bioinf. Comput. Biol. 2007, 5, 429. (29) Lim, K.; Wong, L. Bioinformatics 2014, 30, 189. (30) Soh, D.; Dong, D.; Guo, Y.; Wong, L. BMC Bioinf. 2011, 12 (Suppl 13), S15. (31) Lim, K.; Li, Z.; Choi, K. P.; Wong, L. J. Bioinf. Comput. Biol. 2015, 13, 1550018. (32) Khatri, P.; Draghici, S. Bioinformatics 2005, 21, 3587. (33) Pavlidis, P.; Lewis, D. P.; Noble, W. S. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing 2002, 474. (34) Goeman, J. J.; van de Geer, S. A.; de Kort, F.; van Houwelingen, H. C. Bioinformatics 2004, 20, 93. (35) Goh, W. W.; Lee, Y. H.; Ramdzan, Z. M.; Sergot, M. J.; Chung, M.; Wong, L. J. Proteome Res. 2012, 11, 1571. (36) Button, K. S.; Ioannidis, J. P.; Mokrysz, C.; Nosek, B. A.; Flint, J.; Robinson, E. S.; Munafo, M. R. Nat. Rev. Neurosci. 2013, 14, 365. (37) Langley, S. R.; Mayr, M. J. Proteomics 2015, 129, 83. (38) Rost, H. L.; Rosenberger, G.; Navarro, P.; Gillet, L.; Miladinovic, S. M.; Schubert, O. T.; Wolski, W.; Collins, B. C.; Malmstrom, J.; Malmstrom, L.; Aebersold, R. Nat. Biotechnol. 2014, 32, 219. (39) Raju, T. N. Pediatrics 2005, 116, 732. (40) Goh, W. W.; Fan, M.; Low, H. S.; Sergot, M.; Wong, L. BMC Genomics 2013, 14, 35. (41) Goh, W. W. B.; Oikawa, H.; Sng, J. C. G.; Sergot, M.; Wong, L. Bioinformatics 2012, 28, 453. (42) Ashburner, M.; Ball, C. A.; Blake, J. A.; Botstein, D.; Butler, H.; Cherry, J. M.; Davis, A. P.; Dolinski, K.; Dwight, S. S.; Eppig, J. T.; Harris, M. A.; Hill, D. P.; Issel-Tarver, L.; Kasarskis, A.; Lewis, S.; Matese, J. C.; Richardson, J. E.; Ringwald, M.; Rubin, G. M.; Sherlock, G. Nat. Genet. 2000, 25, 25. (43) Zhou, X.; Kao, M. C.; Wong, W. H. Proc. Natl. Acad. Sci. U. S. A. 2002, 99, 12783. (44) Boyle, E. I.; Weng, S.; Gollub, J.; Jin, H.; Botstein, D.; Cherry, J. M.; Sherlock, G. Bioinformatics 2004, 20, 3710. (45) Zheng, Q.; Wang, X. J. Nucleic Acids Res. 2008, 36, W358. (46) Dennis, G., Jr.; Sherman, B. T.; Hosack, D. A.; Yang, J.; Gao, W.; Lane, H. C.; Lempicki, R. A. Genome biology 2003, 4, P3. (47) Lim, K.; Li, Z.; Choi, K. P.; Wong, L. J. Bioinf. Comput. Biol. 2015, 13, 1550018. (48) Goh, W. W.; Wong, L. J. Bioinf. Comput. Biol. 2016, 5 (14), 16500293.
M
DOI: 10.1021/acs.jproteome.6b00402 J. Proteome Res. XXXX, XXX, XXX−XXX