A Multidimensional Matrix for Systems Biology Research and Its

Systems Biology Initiative and School of Biotechnology and Biomolecular Sciences, University of New South Wales, New South Wales, Australia. J. Proteo...
0 downloads 0 Views 2MB Size
Article pubs.acs.org/jpr

A Multidimensional Matrix for Systems Biology Research and Its Application to Interaction Networks Chi Nam Ignatius Pang, Apurv Goel, Simone S. Li, and Marc R. Wilkins* Systems Biology Initiative and School of Biotechnology and Biomolecular Sciences, The University of New South Wales, New South Wales, Australia S Supporting Information *

ABSTRACT: A multidimensional matrix containing 76 parameters from 21 transcriptomics, proteomics, interactomics, phenotypic and sequence-based data sets, in which each data set covered most of the Saccharomyces cerevisiae proteome, was compiled for systems biology research. The maximal information coefficient (MIC) was used to measure correlations between every pair of parameters. Out of 2850 possible comparisons, 340 pairs of variables (12%) showed statistically significant MIC scores. There were 321 relationships that were expected; these included relationships within physicochemical parameters of proteins, between abundance levels of genes/proteins and expression noise, and between different types of intracellular networks. We found 19 potentially novel relationships between different types of “-omics” data. The strongest of these involved genetic interaction networks, which were correlated with pleiotropy and cell-to-cell variability in protein expression. Protein disorder also showed a number of significant relationships with protein abundance, signaling and regulatory networks. Significant crosstalk was seen between the signaling and kinase interaction networks. Investigation of this revealed densely connected kinase clusters and significant signaling between them, along with signaling centers that act as integrators or broadcasters of intracellular information. These centers may allow for redundancy and a means of dampening noise in networks under a variety of genetic or environmental perturbations. KEYWORDS: maximal information coefficient, protein interaction networks, signaling, systems biology



associations in many large “-omics” data sets in a systematic manner.26 A major benefit of correlation analyses is that it provides a global view on how pairs of parameters from a large data set can be related to each other. These are measured consistently using a common correlation score, which facilities comparisons across different data sets. A recently developed method, called the maximal information coefficient (MIC),26 allows pairwise correlation to be analyzed systematically across large data sets. The concept behind MIC is based on drawing a grid on the scatter plot of two variables and calculating the sum of the mutual information in each cell as a measure of the relationship between the two variables. For each x-by-y grid, MIC finds the best partition that provides the highest mutual information score, which is recorded in the corresponding (x, y) entry of a result matrix. The previous step is repeated for all grid dimensions up to a predefined resolution based on the sample size. These scores in the result matrix are normalized to values within 0 to 1, and the maximum score in the result matrix is the MIC score. The MIC is more “general” than Pearson correlation as it can measure more extensive types of correlations, including linear, nonlinear, exponential, sinusoidal,

INTRODUCTION In recent years, there has been rapid accumulation of “-omics” data for Saccharomyces cerevisiae. This has allowed systems-level models of this lower-eukaryotic organism to be created. Analysis of these data has provided insights on the regulation of protein expression, the dynamics of biological networks, and of their evolution. First, integration of gene and protein expression data has led to understanding the steady-state of protein expression.1−4 In particular, that steady-state protein abundance is regulated by both sequence-based and posttranslational mechanisms.5−8 Second, integrative analysis of phenotypic, transcriptomics, and proteomics data with biological networks has resulted in important insights on network dynamics.9−13 Proteins that are dynamically expressed are involved in functions such as cell-cycle complexes and signaling networks, which are required for the cell to adapt to different environmental conditions.9−12,14 In addition, analysis of genetic interaction networks has helped decipher the phenotypic landscapes of the cell.15,16 Third, the integration of interactome and interspecies homology data allows us to understand the evolution of biological networks.11,17−19 In the above cases, interpretation of these networks were facilitated by visualization tools (e.g., see refs 20−24). While it is possible to analyze several data sets coherently, such as the analysis of 16 “-omics” parameters,25 it has been a challenge to detect novel © 2012 American Chemical Society

Received: May 1, 2012 Published: September 14, 2012 5204

dx.doi.org/10.1021/pr300405y | J. Proteome Res. 2012, 11, 5204−5220

Journal of Proteome Research

Article

(2007)28 used multidimensional protein identification technology (MudPIT), tandem mass spectrometry, and calculation of protein abundance using observed peptide counts to provide absolute protein expression (APEX) quantification. Protein abundance is measured for cells grown in rich medium and minimal medium, and they are labeled “abundance_apex_rich_lu_2006” and “abundance_apex_minimal_lu_2006”, respectively. Newman et al. (2006)4 used GFP-tagged proteins and flow cytometry to measure protein abundance with single-cell resolution. The average protein abundance from a population of cells grown in either minimal or rich media are labeled as “avg_abundance_rich_newman_2006” and “avg_abundance_minimal_newman_2006”, respectively. de Godoy et al. (2008)29 used stable isotope labeling by amino acids in cell culture (SILAC) to measure the change in protein abundance in haploid and diploid yeast. The total peptide intensities for both the haploid and diploid experiments were used as a proxy for the protein abundance as described previously.29 This data set is labeled as “abundance_de_godoy_2008”. Protein half-lives (minutes) from Belle et al. (2006)1 were generated using TAP-tag, TAP-specific anticalmodulin binding protein antibody and Western blot. Abundance is measured at various time points starting from the time when protein synthesis is inhibited by cycloheximide. It is labeled as “half_life_belle_2006” in our data set. Cell-to-Cell Variation in Protein Expression. Newman et al. (2006)4 used GFP-tagged proteins and flow cytometry to measure protein abundance with single-cell resolution. This enabled quantitative measurement of protein expression variance (or noise) among a population of cells, represented as coefficient of variation (CV). Measurements are made in cells grown in rich or minimal media, and these values are labeled as “avg_noise_rich_newman_2006” and “avg_noise_minimal_newman_2006” in our data set. The CV values are also normalized by calculating the distance to the median CV values. These values are represented as “norm_noise_rich_newman_2006” and “norm_noise_minimal_newman_2006” for data collected from cells grown in rich or minimal media, respectively. Protein Phosphorylation. The number of known phosphorylation sites for each protein was sourced from Uniprot30 (version 2011_08). It is labeled as “num_phosphorylation_sites_uniprot” in the data set.

elliptical relationships, or a superposition of these relationships. MIC is also more “equitable” in detecting signal from noise since it provides similar scores for data that contain equally noisy relationships. While Pearson correlation ranges from −1 for negative correlation to 1 for positive correlation, MIC ranges from 0 for no correlation to 1 for strong correlation. Therefore, while MIC can measure the strength of correlations in data, additional analysis is required to determine the nature of the relationship between the pairs of data. The aim of this project is to systematically explore novel relationships between multiple “-omics” data sets of S. cerevisiae. Our data matrix includes a selection of transcriptomics, proteomics, protein interaction networks, kinase−substrate interaction networks, genetic interaction networks, phenotypic and sequence-based data. Interactome network data are summarized as network parameters, such as degree and centrality score for each gene or protein. The analyses include 76 parameters from 21 “-omics” data sets, and each parameter can be considered to be an additional dimension in our data matrix or data-cube. The MIC correlation score are calculated for every pair of parameters, resulting in a total of 2850 pairwise correlations. By limiting the false discovery rate to 5%, we confirmed previously known associations in “-omics” data sets and also discovered novel associations.



MATERIALS AND METHODS

Transcriptomics Data

The data generated by Beyer et al. (2004)27 contains three sets of parameters on protein post-transcriptional expression regulation, generated using microarray and ribosomal profiling techniques. The first parameter “copy_number_beyer_2004” reports the average copies of mRNA per gene. The second parameter “ribo_density_beyer_2004” reports the density of ribosome per mRNA that is undergoing translation. It is equal to the length of the mRNA product divided by the average number of ribosome for the corresponding mRNA represented as a percentage. The third parameter “occupancy_beyer_2004”, measures ribosome occupancy, which is the fraction of the mRNA transcripts per gene engaged in translation. Using polysome profiling and deep sequencing of ribosomeprotected mRNA fragments, Ingolia et al. (2009)3 is able to provide the most accurate large-scale data to-date on mRNA copy number and protein translation rate. This data set provided normalized next-generation sequence data represented as reads per kilobase of exon model per million mapped reads (rpkM). The mRNA copy numbers are measured for cells grown in either rich medium or medium without amino acids, labeled as “copy_number_rich_ingolia_2009” and “copy_number_noaa_ingolia_2009”, respectively in our data set. Similarly, the translation rates are also measured for cells grown in either rich medium or medium without amino acids, and they are included in our data set as “footprinting_rich_ingolia_2009” and “footprinting_noaa_ingolia_2009”, respectively. Average values are reported for genes with replicate results whenever it is possible, or the single measurement is reported if replicate measurements are not available.

Phenotypic Data

Gene Deletion. The data from Giaever et al. (2002)31 represents yeast gene essentiality (essentiality_giaever_2002). It is a binary variable, in which 0 indicates viable phenotype when the gene is deleted, and 1 indicates inviable phenotype when the gene is deleted. Gene Overexpression. The data from Sopko et al. (2006)32 measures whether the protein causes deleterious phenotype when it is overexpressed (overexpression_severity_sopko_2006). This parameter ranges from 1 to 5, representing severe toxic effect (1) to no phenotypic effect on overexpression (5). Pleiotropy. The data generated by Dudley et al. (2005)33 represents the number of phenotypes associated with a mutant with the loss of a single gene (pleiotropy_dudley_2005). The degree of pleiotropy for each gene was measured by exposing 4710 single deletion mutations to 21 environmental conditions.

Proteomics Data

Protein Abundance. Protein abundance (copies/cell) from Ghememaghami et al. (2003)2 was generated using TAP-tag and quantitation with Western blot. It is labeled in our data set as “abundance_ghememaghami_2003”. Lu et al. 5205

dx.doi.org/10.1021/pr300405y | J. Proteome Res. 2012, 11, 5204−5220

Journal of Proteome Research

Article

microarray chip (ChIP-chip). This data set is labeled as “lee_2002”.

Phenotypes that were observed in biological duplicates were included the data set. A number represents the number of environmental conditions that leads to a significant deleterious growth phenotype for the mutant.

List of Network Parameters

Various network parameters that summarize the property of this network are included in our data set. The network parameters are calculated using the igraph package in R.47 The label for each parameter is composed of the network label, followed by a “_” and the name of the network parameter. For example, the list of node degree (degree) for the kinase interaction networks (kpi) is labeled as “kpi_degree” in our data set. The list of network parameters is described below. The “degree” of a node is the total number of interaction partners of this node. For the transcription networks and kinase−substrate networks, which are both directed networks, the degree can be further subcategorized as “in degree” or “out degree”. The “in degree” is the number of interaction partners in which the interactions are directed toward this node. For the kinase−substrate network, the in degree of a protein is the number of kinases which can phosphorylate the protein. The in degree in the context of the transcriptional regulatory network is the number transcription factors that can activate expression of the relevant gene. The “out degree” is the number of interaction partners in which interaction is directed away from the node. For the kinase−substrate network, the out degree of a kinase is the number of protein substrates it can phosphorylate. In the context of the transcriptional regulatory networks, the out degree is the number of genes in which the corresponding promoter can be bound by the transcription factor. The “betweenness” for a node is the total number of shortest paths that passes through this node. This value takes into account all possible shortest paths between each pair of nodes in the network. The “closeness” parameter measures of how many steps are required to access every other node from the relevant node. It represents the inverse of the average length of the shortest paths to reach all other nodes. For two nodes found in disconnected components of the network, the total number of nodes in the network was used as a proxy for the shortest path. When measured in the context of a directed network, the closeness for inward and outward directed paths are referred to as “closeness in” and “closeness out” correspondingly. The “clustering coefficient” parameter measures of the degree of connectivity between immediate binding partners of the relevant node. It is the number of interactions between the immediate neighbors of the node, represented as a fraction of the total possible number of interactions among the immediate neighbors. It is also called the clustering coefficient. The “eigenvector centrality” parameter measures whether the node is central to the network. The formulation of this parameter is beyond the scope of this manuscript but is described in detail in the “igraph” manual.47,48 The “page rank” parameter was formulated by Google founders Sergey Brin and Larry Page in 1998.49 It is a heuristics for ranking the importance and centrality of each node in a network.

Sequence-Based Data

Protein Domains. The number of Pfam domains for each protein (num_domains_pfam) is obtained from the Pfam database release 20.34 Protein Disorder. The percentage of residues in a protein that is in an intrinsically disordered region from Kim et al. (2008)35 is labeled in the data set as “percent_disorder_kim_2008”. Other Sequence-Based Data. A number of parameters are downloaded from the Saccharomyces Genome Database (SGD) (http://www.yeastgenome.org/download-data/) on 15 September, 2011.36 These includes protein molecular weight (molecular_weight_sgd), protein length (protein_length_sgd), and protein isoelectric point (pi_sgd), grand average of hydropathicity score (gravy_sgd),37 and aromaticity (aromaticity_sgd), which is the frequency of aromatic amino acids, Phe, Tyr, and Trp in the protein. The data set also includes the codon adaptation index (cai_sgd)38 and the frequency of optimal codons index (fop_score_sgd),39 which are measures that estimate protein abundance. Interaction Networks Data

Protein Complexes. Hart et al. (2007)40 generated a list of protein complexes, and the parameter “is_in_complex_hart_2007” indicates whether the protein belongs to a protein complex. It is a binary variable, in which 1 indicates the protein belongs to a protein complex, and 0 indicates the protein does not belong to a complex. The high-quality protein interaction network, in which every protein−protein interaction is verified by two or more experiments methods, was curated as a part of this study. It was curated using the methods described by Han et al. (2004)41 and Bertin et al. (2007)42 with minor modifications described in the Supporting Information, Methods. This network is labeled as “sbi” in the data set. The list of interactions for this network is available as Supporting Information (Table S1). The kinase interaction networks were built by combining the data from Breitkreutz et al. (2010)43 and Fasolo et al. (2011).44 This network is labeled as “kpi” (kinase protein−protein interactions) in the data set. The method for constructing this network is described in the Supporting Information, Methods. The list of interactions for this network is available as Supporting Information (Table S2). Genetic Interaction Networks. Costanzo et al. (2010)15 measured the fitness of double gene-deletion S. cerevisiae mutants and generated the stringent genetic interaction networks (costanzo_stringent_2010) used in this study. Similarly, Fiedler et al. (2009)16 generated the genetic interaction networks (fiedler_2009), which include all known kinases and phosphatases of proteins and small-molecules and proteins that played a key role in cellular regulation. Kinase−Substrate Networks. The network was generated by Ptacek et al. (2005)45 using proteome chips incubated with the addition of [γ-33P] ATP substrate and the kinase for the specific study. This data set is labeled as “ptacek_2005”. Transcriptional Regulatory Networks. This network was generated by Lee et al. (2002)46 and contains data on the transcriptional regulatory networks. The data is collected using chromatin immunoprecipitation (ChIP) and analyses of

Calculation of the Maximal Information Coefficient

The maximal information coefficient (MIC)26 was used to investigate the relationships between 76 “-omics” parameters described above. Each parameter is represented by a list of genes (or proteins) in which each gene (or protein) is attached 5206

dx.doi.org/10.1021/pr300405y | J. Proteome Res. 2012, 11, 5204−5220

Journal of Proteome Research

Article

substrate interactions than expected by chance between a pair of clusters in comparison to 10 000 randomized networks. A pvalue of less than 0.01 defines the intercluster phosphorylation as significant. The randomization was performed by shuffling the proteins in each cluster but keeping the size of the clusters and the kinase−substrate interaction networks constant. For the intercluster signaling to be included in the significant network, the kinases in the first cluster must phosphorylate at least four proteins in another cluster, and this pair of clusters must be observed at least 2000 times among the 10 000 random networks to ensure high statistical power. The above was repeated with different versions of the clusters created with a list of inflation parameters from 1.4 to 5 with an increment of 0.2. The inflation value of 1.6 maximized the number of kinase−substrate interactions between the clusters, and therefore this value is the optimal value and is used for this study. The significant intercluster signaling network was visualized using the GEOMI graph visualization tool.23,51 Gene Ontology Enrichment of Each Cluster in the Kinase Interaction Networks. The cellular compartment and biological processes of each cluster were analyzed for overrepresented Gene Ontology (GO) terms.53 Several R packages50 were used to search for enriched GO terms. These includes GOstats,54 GSEABase, org.Sc.sgd.db, and GO.db.55 Benjamini−Hochberg false discovery rate correction was calculated using the “multtest” package. GO terms with Benjamini−Hochberg adjusted p-value of less than 0.05 is considered to be significantly over-represented.

to a numerical value. The MIC score represents the correlation between a pair of parameters. For example, correlation is calculated between the mRNA copy numbers for a list of genes and the protein abundances of the corresponding proteins. When comparing each pair of parameters, genes or proteins with missing values were not included in the MIC calculation. The collection of MIC correlation scores for all pairs of value is called the “yeast matrix”. The program for calculating the MIC was obtained from the paper's accompanying Web site (http:// www.exploredata.net/Downloads/MINE-Application). The calculations were performed using the Java program, accessed using the R50 wrapper for MINE. The correlations were calculated using default parameters, in which the exponent variable was set to 0.6, and the number of clumps factor set to 15. The p-value tables, which convert the MIC score and the number of data points to a p-value, were obtained from the paper's accompanying Web site (http://www.exploredata.net/ Downloads/P-Value-Tables). The closest entry would be used to provide an approximate p-value if there was no exact entry in the p-value table. The Benjamini−Hochberg false discovery rate correction was applied to all p-values to limit the false positive rate to 5%. This means only relationships with corrected pvalue of less than 0.05 were considered to be statistically significant, and the corrected p-value is used for the remainder of the manuscript. The Spearman correlations were also calculated for the data sets using R to allow comparison with the MIC correlation score. The R statistical software package was used to create the heat maps, the scatter plots, and to calculate the locally weighted scatter plot smoothing (lowess) regression line.50 The graph visualization tool GEOMI23,51 was used to visualize statistically significant MIC correlations (corrected p-value