Article pubs. acs. org/crt
Discovering Functional Modules by Topic Modeling RNA-Seq Based Toxicogenomic Data Ke Yu, Binsheng Gong, Mikyung Lee, Zhichao Liu, Joshua Xu, Roger Perkins, and Weida Tong* Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, Arkansas 72079, United States S Supporting Information *
ABSTRACT: Toxicogenomics (TGx) endeavors to elucidate the underlying molecular mechanisms through exploring gene expression profiles in response to toxic substances. Recently, RNA-Seq is increasingly regarded as a more powerful alternative to microarrays in TGx studies. However, realizing RNASeq’s full potential requires novel approaches to extracting information from the complex TGx data. Considering read counts as the number of times a word occurs in a document, gene expression profiles from RNA-Seq are analogous to a word by document matrix used in text mining. Topic modeling aiming at to discover the latent structures in text corpora would be helpful to explore RNASeq based TGx data. In this study, topic modeling was applied on a typical RNA-Seq based TGx data set to discover hidden functional modules. The RNASeq based gene expression profiles were transformed into “documents”, on which latent Dirichlet allocation (LDA) was used to build a topic model. We found samples treated by the compounds with the same modes of actions (MoAs) could be clustered based on topic similarities. The topic most relevant to each cluster was identified as a “marker” topic, which was interpreted by gene enrichment analysis with MoAs then confirmed by compound and pathways associations mined from literature. To further validate the “marker” topics, we tested topic transferability from RNA-Seq to microarrays. The RNA-Seq based gene expression profile of a topic specifically associated with peroxisome proliferator-activated receptors (PPAR) signaling pathway was used to query samples with similar expression profiles in two different microarray data sets, yielding accuracy of about 85%. This proof-of-concept study demonstrates the applicability of topic modeling to discover functional modules in RNA-Seq data and suggests a valuable computational tool for leveraging information within TGx data in RNA-Seq era.
■
INTRODUCTION One of the major applications of toxicogenomics (TGx) in toxicology is gaining understanding of molecular mechanisms of toxicity through investigating the gene expression profiles altered by toxic substances.1 Both the nature of TGx data and the purpose of TGx studies require bioinformatics tools for data storage, analysis, and interpretation. Since RNA-Seq was introduced,2−4 many advantages of this transcriptomic technology have been postulated and many demonstrated and promoted.4,5 Nowadays, RNA-Seq is regarded as a promising alternative to microarrays in TGx studies. Accordingly, bioinformatics tools to support RNA-Seq based TGx studies should be prepared for the prospective transition from microarrays to RNA-Seq. While the approaches developed for the analysis of microarray data will continue to have a significant role in this field, it is certainly desirable to examine whether other types of tools can enable extraction of the inherently greater knowledge in RNA-Seq data than in microarray expression data. Gene expression profiles from RNA-Seq can be considered as a matrix, where rows correspond to samples, columns correspond to genes, and the discrete number in each cell corresponds to mapped read counts per gene. If the genes in RNA-Seq were considered to be analogous to unique words, the read counts of © XXXX American Chemical Society
genes for each sample would be analogous to the prevalence of words in a document. Thus, the structure of RNA-Seq data is similar to that of a word by document (word−document) matrix, where rows correspond to documents, columns correspond to unique words, and the discrete number in each cell is the number of times a word occurs in that document. In addition, the reads in RNA-Seq are independently sampled from a population with fixed fractions of genes and such that the detected read counts will follow a multinomial distribution.6 Thus, the distribution of read counts in RNA-Seq data is also similar to that of the word distributions in a document collection.7 The similarities between genes and words and between read counts and word counts suggest topic modeling approaches that are widely used for text mining in document collections7,8 will be analogously valuable for knowledge mining of RNA-Seq data. Topic modeling explicitly assumes that each document is a mixture of topics, and each topic is a mixture of the associated words. By analogy, the RNA-Seq based gene expression profile for each sample can be considered as a mixture of functional modules (e.g., pathways), while each module is a mixture of Received: April 24, 2014
A
dx.doi.org/10.1021/tx500148n | Chem. Res. Toxicol. XXXX, XXX, XXX−XXX
Chemical Research in Toxicology
Article
similarity of gene expression profiles for the replicate rats are significantly higher for the 50% highly expressed genes than for the remaining 50%. Additionally, 50 genes with low expression variance were also removed from modeling. Removing the low expression variance genes is deemed advantageous because they do not provide differential information. Such genes are analogous to stop-words (e.g., the, a, of, for, it, is, etc.) in textual documents which are typically filtered out. After removal of low expression and low expression variance genes, a total of 11122 genes remained. By analogy to textual topic modeling, each sample’s expression profile constitutes a document, each gene constitutes a unique word, and each gene’s read count corresponds to the number of times a word occurs in a document. Topic modeling produces two matrices, topic distributions for the documents and word distributions for the topics. Topic distributions of samples from the duplicated experiments (i.e., the same compound) were averaged, and the Jensen−Shannon (J-S) divergence was used to calculate the compound similarities based on topic distributions with the imposed constraint that each be composed of at least three compounds. The “marker” topic most relevant to the compounds in each cluster was defined to be topic j, for which the ratio of its mean probability values θj for the compounds in that cluster to the sum of θj for the compounds in all other clusters was greatest.13 Model Selection. A LDA model is conditioned on three key parameters, i.e., the Dirichlet hyperparameters alpha (α) and beta (β), and the number of topics (T). Alpha and beta determine the sparsity of per-document topic distribution and per-topic word distribution, respectively. From a probabilistic standpoint, a smaller α (e.g., α = 0.1) results in each document to be more probabilistically associated with fewer topics. Likewise, a smaller β (e.g., β = 0.01) leads to a topic to be more probabilistically associated with fewer words. Given values of α and β, the optimal T can be estimated based on the model’s fitness to reproduce the data from which it was derived. In this study, we used perplexity,14 a widely used metric for predictive quality of a language model, to indicate the model goodness of fit. The model with the lowest perplexity was considered to have the best fitness, that is, the highest likelihood given the data. Gene Enrichment Analysis. Database for Annotation, Visualization, and Integrated Discovery (DAVID) (http://david.abcc.ncifcrf. gov/)15 was used for the gene enrichment analysis to identify the related pathways in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database for the 500 genes with the highest probability association with each “marker” topic. To confirm the specificity of the 500 high probability genes in each “marker” topic, we compared the results with 500 genes randomly sampled from the RNA-Seq gene list (containing 11122 unique genes) for gene enrichment analysis, and this random sampling was conducted for 100 times to avoid sampling bias. “Marker” topics were confirmed on the basis of compound and MoAs associations reported in the literature. Validation of the “Marker” Topics. On the basis of the conjecture that MoAs for a compound should be observed in the gene expression profiles of the corresponding samples measured by both RNA-Seq and microarrays, we tested the transferability of the “marker” topics from RNA-Seq to microarrays for further validation. RNA-Seq profile of a “marker” topic was used as one vector, and the cosine similarity was computed for “marker” topics versus vectors made up of the expression profiles of the same genes from each of the two microarray data sets. The “marker” topic profile from RNA-Seq was for the first 500 highly associated genes averaged across multiple samples in one cluster. Because one compound can generate multiple samples by different dose and different duration time, we defined the queried compound in microarray data sets as that more than three samples treated by the same compound were queried by the “marker” topic. The MoAs for the queried compounds from two microarray data sets were verified by the MoAs reported in the literature or through the gene enrichment analysis for the corresponding microarray DEGs.
associated genes. In this proof-of-concept study, topic modeling was performed to explore a typical RNA-Seq data set. We hypothesized that the functional modules associated with modes of actions (MoAs) could be implicated by the discovered “topics” from the “documents” transformed from RNA-Seq profiles. We found that samples treated by the compounds with the same MoAs could be clustered based on topic similarities. The “marker” topic, which is the most relevant topic to indicate the MoAs for compounds in each cluster, was first interpreted by gene enrichment analysis and then confirmed with knowledge from literature. The aim of this study was to demonstrate the applicability of topic modeling to leverage information within TGx data in RNA-Seq era.
■
MATERIALS AND METHODS
RNA-Seq Based TGx Data. The RNA-Seq data were obtained from the U.S. Food and Drug Administration’s (FDA’s) sequencing quality control (SEQC) study. The details about animal care, platform, toxicological experiment, RNA sample extraction, RNA-Seq library preparation, and sequencing were documented in the publication of the SEQC study.9 In this study, we used TGx data from 36 rat liver samples treated by 12 compounds (most are drugs) from three different MoAs. The RNA-Seq data were generated by Illumina HiScan (San Diego, USA). Sample information was shown in Supporting Information, Table S1. Data were preprocessed by a National Center for Toxicological Research (NCTR) bioinformatics pipeline for the mapping and quantification.9 Briefly, the mapping and quantification were performed by Novoalign v2.08.01 (www.novocraft.com) against rat genome rn4 from UCSC (https://genome.ucsc.edu/). The processed RNA-Seq data normalized by read per million (RPM) were used to build the topic model. Both RPM normalized data and log2 scaled RPM data are available as supplementary data in the Supporting Information. Two Microarray Data Sets. The first microarray data set is from the first phase of the Japanese toxicogenomics project (TGP1) available from TG-GATEs (http://toxico.nibio.go.jp/english/index.html),10 where the expression measurement platform was Affymetrix GeneChip Rat Genome 230 2.0 Array. The microarray data were from the rat liver samples in an in vivo repeat-dose study with multiple duration and multiple dose levels that contained a total of 6249 samples. The Japanese consortium conducted a second phase (TGP2), providing data for verifying biomarkers that were not used in this study. The second microarray data set is from DrugMatrix that is a comprehensive TGx database and publically accessible by ftp://anonftp. niehs.nih.gov/drugmatrix/. In this study, the microarray data were from the rat liver samples in an in vivo repeat-dose study profiled by Affymetrix GeneChip Rat Genome 230 2.0 Array, where 200 compounds were tested with a total of 2218 samples. The original TGP and DrugMatrix microarray data were both normalized by MAS5. The probe sets were mapped to gene symbols using a R routine from Bioconductor (http://bioconductor.org/) named rat2302. db.11 Differential expression gene (DEG) was calculated by comparing the expression levels of the treated samples against those from the corresponding control samples. The student’s t test was performed to evaluate expression fold change significance. Finally, significant DEGs were selected as those simultaneously meeting two conditions, fold change greater than 1.5, and p-value in t test less than 0.05. Topic Modeling. Latent Dirichlet allocation (LDA) was used in this study, and MALLET, 12 a Java topic modeling package, was implemented. Noise filtering is a prerequisite for achieving a reliable topic model and is expected to provide signatures capable of more truly elucidating gene pathways involved in the differential response. According to the results of SEQC study,9 50% of genes with the lowest expression were removed prior to modeling. The SEQC study revealed several findings that justify such removal. First, at least 95% of the total reads in RNA-Seq were observed to appear in the top 50% highly expressed genes. Second, the 50% highly expressed genes had much higher reproducibility than the 50% lowly expressed genes. Finally, B
dx.doi.org/10.1021/tx500148n | Chem. Res. Toxicol. XXXX, XXX, XXX−XXX
Chemical Research in Toxicology
Article
Figure 1. Perplexity of the topic models for different settings of number of topics (T) with different Dirichlet hyperparameters α (a) and β (b).
Figure 2. Sample similarities based on the topic distributions of samples. PPAR, peroxisome proliferator-activated receptors; ER, estrogen receptor; HMGCoA, 3-hydroxy-3-methylglutaryl-CoA; PIR, pirinixic acid; BEZ, bezafibrate; NAF, nafenopin; CFA, clofibric acid; ROS, rosiglitazone; GEM, gemfibrozil; NOR, norethindrone; EES, ethinyl estradiol; BES, β-estradiol; CER, cerivastatin; SIM, simvastatin; LOV, lovastatin. The values in cells are similarity values calculated based on Jensen−Shannon divergence among the topic distributions.
Figure 3. Determining the “marker” topic for the samples in each category. (a) Values in the cells are calculated according to the definition of “marker” topics given in the Materials and Methods section; (b) topic with the highest value is selected as the “marker” topic for a category; (c) identified KEGG pathways using gene enrichment analysis of the associated gene in each topic.
■
RESULTS Determining an Optimal Topic Model. Results of sensitivity studies to choose topic model parameters are shown in Figure 1. The model perplexity on the RNA-Seq data set was taken as a surrogate for model quality and robustness, which is a common practice in topic modeling. A lower perplexity corresponding to a better fit in that the model is less perplexed by the data. Hyperparameter α controls the shape of the per sample topic multinomial distribution of the Dirichlet prior. Figure 1a shows that a desirable low perplexity is obtained for α equal to 0.1. Hyperparameter β controls the shape of the per
topic gene multinomial distribution of the Dirichlet prior. Figure 1b shows that β in the range of 0.01−0.1 yields a somewhat lower perplexity. Thus, an α of 0.1 and β of 0.01 was deemed preferable on the basis of model perplexity. Figure 1 also shows that perplexity increases with increasing number of topics (T), which is indicative of overfitting. The number of topics was chosen to be 10 in order to minimize overfitting while having enough topics to maintain sufficient specificity to differentiate RNA-Seq samples. Clustering Samples by Topic Similarities. The topic model was built with RNA-Seq data for the 36 rat liver samples described in Material and Methods. The topic model yields C
dx.doi.org/10.1021/tx500148n | Chem. Res. Toxicol. XXXX, XXX, XXX−XXX
Chemical Research in Toxicology
Article
Implicating the “Marker” Topics. Activating PPARs induces the transcription of a number of genes facilitating lipid metabolism. For 43 genes regulated by PPARs in PPAR signaling pathway, 23 genes were clustered in topic-3 (i.e., the “marker” topic for PPARα agonists). Specifically, HMGCS2 in topic-3 is the target gene for ketogenesis; SCD-1, ME1, and FADS2 in topic-3 are the target genes for lipogenesis; CYP7A1, CYP8B1, CYP27, and LXRα in topic-3 are the target genes for cholesterol metabolism; Bien, CYP4A1, thiolase B, SCP-X, ACO, CPT-2, LCAD, and MCAD in topic-3 are eight target genes for fatty acid oxidation. Ketogenesis, lipogenesis, cholesterol metabolism, and fatty acid oxidation are responsible for the lipid metabolism. Thus, the association of topic-3 with the PPAR signaling pathway is implicated. In addition, fatty acid metabolism identified in topic-3 is also involved in lipid metabolism. Estrogen response elements (EREs) are associated with a number of coagulation factors.18 It has been demonstrated that near-consensus EREs occur in many of the genes involving in the procoagulant and anticoagulant pathways.19 In addition, regulating the gene expression levels mediated by ERs for the enzymes involving in hepatic metabolism has been well studied.20 Furthermore, liver is the main organ responsible for producing coagulation factors and drug metabolism. Thus, the associations of topic-1 with the pathways including compliment and coagulation cascades and drug metabolism are reasonable. Statins are for lowering cholesterol levels by inhibiting HMGCoA reductase, which plays a central role in the production of cholesterol in the liver. In addition, statins can enhance the biosynthesis of unsaturated fatty acids,21 and replacing saturated fatty acids with unsaturated fatty acids helps to lower cholesterol levels. Moreover, activating PPARs can promote the cholesterol removal,22 and some statins have been observed as the PPARs agonists.23−25 Thus, the associations of topic-8 with the pathways including biosynthesis of unsaturated fatty acids and PPAR signaling pathway are explainable. Applying the “Marker” Topic to Query Microarray Data Sets. We observed that some compounds (e.g., WY-14643, fenofibrate, and gemfibrozil) having MoAs associated with PPAR signaling pathway were contained in both TGP and DrugMatrix microarray data sets. Topic-3 indicating the PPAR signaling pathway was used to query, based on the cosine similarity, the corresponding samples in the microarray data sets. As shown in Table 2, a total of 91 samples in TGP microarray data set from 13 compounds undergoing different treatment (i.e., three levels of doses and four levels of duration times) were identified by topic3. Samples from all different treatments of WY-14643 were identified, lending confidence that WY-14643 was the compound involved in PPAR signaling pathway. The same held for fenofibrate and gemfibrozil. Of these 13 identified compounds, eight (61.5%) compounds (i.e., WY-14643, fenofibrate, gemfibrozil, clofibrate, benzbromarone, simvastatin, aspirin, and amiodarone) were verified by literature to be associated with PPAR signaling pathway, and four (30.8%) compounds (i.e., benziodarone, terbinafine, bendazac, and chloramphenicol) were verified by gene enrichment analysis through their microarray DEGs, leaving only acetaminophen unconfirmed (Figure 3a). This query process was also carried out for the DrugMatrix microarray data set. Table 3 gives results for 79 identified samples treated by 18 different compounds. Among these identified compounds, 14 (77.8%) (i.e., fenofibrate, gemfibrozil, bezafibrate, lovastatin, methyl salicylate, valproic acid, pirinixic acid, pioglitazone, nafenopin, ibuprofen, dexamethasone, clofibric acid, clofibrate, and aspirin) were verified to be involved in the
probabilistic associations of 10 topics for each sample. Distributions of the discovered first five topics for all 36 samples were shown in Supporting Information, Table S2. J-S divergence similarity was computed between the topic distributions among all samples as shown in Figure 2. The samples clustered into three distinct categories in accordance with known MoAs of the sample treatment compounds (see sample information given in Supporting Information, Table S1). All peroxisome proliferatoractivated receptor α (PPARα) agonists, including pirinixic acid, bezafibrate, nafenopin, clofibric acid, and gemfibrozil, clustered in the first category. Rosiglitazone, a selective ligand of PPARγ but having no PPARα-binding action,16 did not cluster in this category. Estrogen receptor (ER) ligands, including norethindrone, ethinyl estradiol, and β-estradiol, clustered in the second category. 3-Hydroxy-3-methyl-glutaryl-CoA (HMG-CoA) reductase inhibitors,17 including cerivastatin, lovastatin, and simvastatin, clustered in the third category. Identifying the “Marker” Topics. Figure 3 shows the identified “marker” topics for compounds in three categories. By means of gene enrichment analysis of the first-500 highly associated genes in each “marker” topic, two significant pathways identified with KEGG were listed (Table 1). Topic-3, the Table 1. KEGG Pathways Identified by the Associated Genes in the “Marker” Topicsa no.
term
KEGG Pathways in Topic-3 1 PPAR signaling pathway 2 fatty acid metabolism KEGG Pathways in Topic-1 1 complement and coagulation cascades 2 drug metabolism KEGG Pathways in Topic-8 1 biosynthesis of unsaturated fatty acids 2 PPAR signaling pathway
count
%
P-value
Benjamini adjusted P-values
32
7.3
6.5 × 10−24
9.2 × 10−22
24
5.5
7.5 × 10−21
5.2 × 10−19
30
6.9
3.0 × 10−22
4.2 × 10−20
23
5.3
7.8 × 10−14
5.4 × 10−12
18
4.2
7.4 × 10−19
9.8 × 10−17
25
5.9
3.3 × 10−16
2.2 × 10−12
a
KEGG pathways are identified using top-500 genes through the Database for Annotation, Visualization and Integrated Discovery (DAVID).
“marker” topic for compounds in the first category (i.e., PPARα agonists), was found to be relevant to PPAR signaling pathway (p = 6.5 × 10−24) and fatty acid metabolism (p = 7.5 × 10−21). Topic-1, the “marker” topic for compounds in the second category (i.e., ER ligands), was found to be related to compliment and coagulation cascades (p = 3.0 × 10−22) and drug metabolism (p = 7.8 × 10−14). Topic-8, the “marker” topic for compounds in the third category (i.e., HMG-CoA reductase inhibitors), was found to be related to biosynthesis of unsaturated fatty acids (p = 7.4−19) and PPAR signaling pathway (p = 3.3 × 10−16). The KEGG pathways detected by 500 randomly sampled genes (repeated 100 times) from the RNA-Seq gene list are shown in Supporting Information, Table S3. A comparison confirmed that the KEGG pathways identified by the first-500 highly associated genes in each of the “marker” topics could not be detected by random sampling of 500 genes from the RNA-Seq gene list. D
dx.doi.org/10.1021/tx500148n | Chem. Res. Toxicol. XXXX, XXX, XXX−XXX
Chemical Research in Toxicology
Article
Table 2. Compounds in TGP Microarray Data Set Queried by the “Marker” Topic Associated with PPAR Signalling Pathwaya compd name WY-14643
fenofibrate
gemfibrozil
clofibrate
duration time (day)
dose (mg/kg)
querying ratio
29 29 29 8 8 15 15 15 8 4 4 4
100 30 10 100 30 30 10 100 10 100 30 10
12/12
15 4 8 8 4 29 29 15 8 29 15 4
100 100 100 1000 1000 1000 100 1000 10 10 10 10
12/12
15 29 4 15 8 8 29 15 8 4 29 4
300 300 300 100 300 100 100 30 30 100 30 30
12/12
8 15 29 4 15 29 8 4 15 8
300 300 300 300 100 100 100 100 30 30
10/12
8 29 15 29
300 300 300 100
8/12
compd name
duration time (day)
dose (mg/kg)
4 8 4 15
300 100 100 100
querying ratio
benzbromarone
8 4 15 29 15 4 29 8
200 200 200 200 60 60 60 60
8/12
simvastatin
8 29 15 4 15 4
400 400 400 400 120 120
6/12
aspirin
4 29 8 15 8
450 450 450 450 150
5/12
amiodarone
29 15 8 4
200 200 200 200
4/12
terbinafine
29 4 15 8
750 750 750 750
4/12
acetaminophen
29 4 4 29
600 1000 600 1000
4/12
bendazac
29 8 4
300 300 300
3/12
chloramphenicol
15 29 29
1000 1000 300
3/12
a
benziodarone
The queried samples were ranked by the cosine similarity. Querying ratio: the number of queried samples/total number of samples generated by each compound.
■
DISCUSSION In this proof-of-concept study, we applied topic modeling to discovering the hidden functional modules in RNA-Seq based TGx data. We observed that samples treated by the compounds with the same MoAs were clustered based on the similarity
PPAR signaling pathway, and the remaining four (22.2%) (i.e., carbon tetrachloride, norethindrone acetate, fluconazole, and azathioprine) could not be confirmed (Figure 3b). Therefore, the total querying accuracy was about 85% (i.e., 92.3% from TGP data set and 77.8% from DrugMatrix data set). E
dx.doi.org/10.1021/tx500148n | Chem. Res. Toxicol. XXXX, XXX, XXX−XXX
Chemical Research in Toxicology
Article
Table 3. Compounds in DrugMatrix Microarray Data Set Queried by the “Marker” Topic Associated with PPAR Signaling Pathwaya compd name fenofibrate
gemfibrozil
bezafibrate
lovastatin
carbon tetrachloride
methyl salicylate
valproic acid
duration (day)
dose (mg/kg)
querying ratio
3 3 5 5 1 1 5 7 1 0.25 0.25 0.25
100 215 215 43 215 43 43 100 100 100 215 43
12/12
3 0.25 3 7 1 7 1 0.25
100 700 700 100 700 700 100 100
8/8
3 7 1 7 0.25 3 0.25 1
617 617 617 100 617 100 100 100
8/8
3 5 5 1 1 0.25
1500 1500 450 450 1500 450
6/6
0.25 3 7 0.25 7
1175 1175 1175 400 400
5/8
5 3 1 0.25
444 444 444 444
4/4
1500
3/3
1
compd name
duration (day)
dose (mg/kg)
querying ratio
5 3
1340 1340
pirinixic acid
1 5 3
364 364 364
3/3
pioglitazone
3 3 5
1500 300 1500
3/3
norethindrone acetate
5 3 1
125 125 125
3/3
nafenopin
1 5 3
338 338 338
3/3
ibuprofen
5 3 1
263 263 263
3/3
fluconazole
1 5 3
394 394 394
3/5
dexamethasone
5 3 3
1 150 1
3/7
clofibric acid
1 3 5
448 448 448
3/3
clofibrate
7 3 7
500 130 130
3/3
azathioprine
5 3 3
54 160 20
3/4
aspirin
5 3 0.25
375 375 375
3/8
a
The queried samples were ranked by the cosine similarity. Querying ratio: The number of queried samples/total number of samples generated by each compound.
measured from their topic distributions. The “marker” topic most relevant to the compounds in each cluster was identified. These topics were interpreted by gene enrichment analysis through using their associated genes and then confirmed by the compound and pathway associations in the literature. For further investigating the utility of the identified “marker” topics, we tested the topic transferability from RNA-Seq to microarrays.
As shown in Figure 2, samples treated with three different classes of compounds were well clustered. Additionally, the “marker” topic for compounds in each cluster was highly relevant to the MoAs of those compounds. For example, topic-3, which was identified to be associated with PPAR signaling pathway and fatty acid metabolism, was the “marker” topic for the compounds in the first category (i.e., PPARα agonists). The “marker” topic for the statin drugs in the third category was relevant to F
dx.doi.org/10.1021/tx500148n | Chem. Res. Toxicol. XXXX, XXX, XXX−XXX
Chemical Research in Toxicology
Article
Figure 4. Confirmation of the queried results by using a “marker” topic associated with PPAR signaling pathway to identify the relevant samples from two different microarray data sets. (a) TGP microarray data set; (b) DrugMatrix microarray data set.
biosynthesis of unsaturated fatty acids and PPAR signaling pathway. Therefore, we concluded that statins were not only involved in the inhibition of HMG-CoA reductase but also in promoting the biosynthesis of unsaturated fatty acids and activating PPARs.23−25 We also found that the “marker” topic for activators of ERs in the second category was relevant to drug metabolism (p = 7.8 × 10−14). Although ERs are widely expressed in different tissue types, liver is not the organ with the highest expression of ERs.26 Thus, liver might not be the target organ of those compounds in the second category. However, liver is the primary organ for metabolism of xenobiotics and contains massive of metabolizing enzymes. It has been reported that ERs can regulate the expression of cytochrome P450 (CYP) enzymes such as CYP2C19,27 2C9,28 2A6,29 and 1B1.30 Complement and coagulation cascades identified by this topic can be interpreted by the association of estrogen with coagulation,31−33 and liver is the main organ responsible for producing coagulation factors. Various serine protease belonging to the coagulation system are able to activate the complement cascade.34 Therefore, all “marker” topics discovered from the RNA-Seq data could be well interpreted, as they were highly relevant to the MoAs for compounds used to generate samples in each cluster. Because of the probability frame, results from topic modeling provide different perspectives and a synergy with other methodologies to achieve an enhanced interpretation of biological systems. Since each word is assigned to multiple topics in a topic model, each gene can involve different functional modules, which should represent better the complicated biological relationships. We observed the overlapping genes among the identified “marker” topics. In addition, the fact that one compound may be involved in multiple MoAs could also be identified by topic modeling. For example, topic-3 was highly related not only to PPAR agonists but also to statins. Samples generated from the treatment of simvastatin and lovastatin in microarray data sets were queried by topic-3 (Tables 2 and 3), which also confirmed that statin drugs were associated with PPAR signaling pathway. Discovering “marker” topic from RNA-Seq data would be helpful for leveraging the legacy microarray data. Samples from 12 treatments of WY-14643, fenofibrate, and gemfibrozil and samples from 10 treatments of clofibrate in the TGP microarray data set were identified by topic-3. However, the results given in Supporting Information, Table S4 shows that the PPAR signaling pathway can not be identified by the microarray DEGs of samples
treated by fenofibrate at low dose (10 mg/kg) or short duration time (4 days), or samples treated by gemfibrate at short duration time (4 days) with different dose levels (30, 100, or 300 mg/kg). Similarly, DEGs detected by microarrays failed to show association with PPAR signaling pathway for the samples treated by gemfibrate at low level (30 mg/kg) for middle duration time (8 days), or by clofibrate at the low dose level (30 mg/kg) for all different duration times (4, 8, 15, or 29 days). Thus, the biological “signals” in microarray data would be amplified by transferring the “marker” topics from RNA-Seq to microarrays. Of note, we filtered the RNA-Seq expression data prior to topic modeling. Given the SEQC study results,9 however, gene expression measured by RNA-Seq and microarray platforms was highly overlapped, thus the similar filtering process was not carried out for microarray data. To the best of our knowledge, it has not yet been reported how many words are sufficient to interpret a topic in topic model. For the gene enrichment analysis, usually hundreds to thousands of microarray DEGs will be implemented to detect the gene pathways. In this study, the top-500 genes associated with those “marker” topics were used to interpret the biological functions of these topics by using the gene enrichment analysis. It was expected that other genes with strong correlations with the first500 genes might have the potential to integrate the identified modules. We calculated the correlation coefficient between all the genes to examine whether the genes strongly correlated to the 500 top ranked could be used to interrogate a specific mechanism. For example, in topic-3 which was related to PPAR mechanism, we observed that 995 genes highly correlated (≥0.8) with the 500 top ranked genes. However, only 14 of these 995 genes were known to be associated with PPAR signaling pathway. The result indicates that the 500 top ranked genes of a topic represent well about the mechanism the topic tends to infer. To confirm the specificity of the observed gene pathways, we compared the results with 500 genes randomly sampled from RNA-Seq data for 100 times, and no such KEGG gene pathways could be detected with the similar significance as that detected by the “marker” topics. Therefore, the KEGG pathways for the “marker” topics were unlikely detected by chance. In this study, principal component analysis (PCA) and hierarchical clustering analysis (HCA) were also applied to comparing with topic modeling and all three methods had the comparable performance in grouping samples by MoAs (Supporting Information, Figure S1). In a topic model, the same gene can be presented in different topics (clusters) with a G
dx.doi.org/10.1021/tx500148n | Chem. Res. Toxicol. XXXX, XXX, XXX−XXX
Chemical Research in Toxicology
Article
ization, and Integrated Discovery; KEGG, Kyoto Encyclopedia of Genes and Genomes; ER, estrogen receptor; HMG-CoA, 3hydroxy-3-methyl-glutaryl-CoA; ERE, estrogen response element; CYP, cytochrome P450
probability frame, which provides additional benefits in displaying the multiple roles of a gene. For example, Scd1 gene was ranked on the top in topic-3 (implying PPAR signaling pathway). Scd1 was also ranked as the 25th gene in topic-8 (implying biosynthesis of unsaturated fatty acids). On the contrary, this gene was absent in topic-1 (implying complement and coagulation cascades and drug metabolism). The results are consistent with our understanding that the enzyme stearoyl-CoA desaturase-1 (SCD1) involves in both the lipid metabolism and lipogenic pathways.35 Consequently, this demands an analytical method that has the ability to recognize multiple roles of the same gene. Thus, topic modeling is more than just a clustering method, which provides additional interpretation, thus compliments to other unsupervised methods such as PCA and HCA.
■
■
CONCLUSIONS In this study, topic modeling was successfully applied to discovering functional modules in RNA-Seq based TGx data. The topics indicating functional modules associated with the corresponding MoAs were identified, interpreted and confirmed. This proof-of-concept study demonstrates the applicability of topic modeling for leveraging information within the TGx data in RNA-Seq era.
■
ASSOCIATED CONTENT
S Supporting Information *
Results from PCA and HCA; sample information; topic distributions of samples; KEGG pathways identified by gene enrichment analysis with 500 randomly sampled genes; KEGG pathways for 4 PPARα agonists in the TGP data set identified by DEGs. Expression normalization (.DOC). Sample name description, processed RNA sequence data, processed RNA sequence data log 2 (.TXT in .ZIP file). This material is available free of charge via the Internet at http://pubs.acs.org.
■
REFERENCES
(1) Nuwaysir, E. F., Bittner, M., Trent, J., Barrett, J. C., and Afshari, C. A. (1999) Microarrays and toxicology: the advent of toxicogenomics. Mol. Carcinog. 24, 153−159. (2) Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., and Wold, B. (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods 5, 621−628. (3) Wang, Z., Gerstein, M., and Snyder, M. (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nature Rev. Genet. 10, 57−63. (4) Nagalakshmi, U., Wang, Z., Waern, K., Shou, C., Raha, D., Gerstein, M., and Snyder, M. (2008) The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320, 1344−1349. (5) Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M., and Gilad, Y. (2008) RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18, 1509−1517. (6) Anders, S., and Huber, W. (2010) Differential expression analysis for sequence count data. Genome Biol. 11, R106. (7) Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003) Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993−1022. (8) Blei, D., Carin, L., and Dunson, D. (2010) Probabilistic Topic Models. IEEE Signal Process. Mag. 27, 55−65. (9) Wang, C., Gong, B., Bushel, P. R., Thierry-Mieg, J., Thierry-Mieg, D., Xu, J., Fang, H., Hong, H., Shen, J., Su, Z., Meehan, J., Guo, C., Li, X., Yang, L. (2014) RNA-seq and microarray gene expression vie for superiority within a comprehensive study design. Nature Biotechnol. In press. (10) Uehara, T., Ono, A., Maruyama, T., Kato, I., Yamada, H., Ohno, Y., and Urushidani, T. (2010) The Japanese toxicogenomics project: application of toxicogenomics. Mol. Nutr. Food Res. 54, 218−227. (11) Carlson, M. rat2302. db: Affymetrix Rat Genome 230 2.0 Array Annotation Data (chip rat2302), R package version 2.8.1. (12) McCallum, A. K. (2002) MALLET: a machine learning for language toolkit, http://mallet.cs.umass.edu. (13) Griffiths, T. L., and Steyvers, M. (2004) Finding scientific topics. Proc. Natl. Acad. Sci. U. S. A. 101 (Suppl 1), 5228−5235. (14) Asuncion, A., Welling, M., Smyth, P., and Teh, Y. W. (2009) On smoothing and inference for topic models, In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence pp 27−34, AUAI Press. (15) Huang, D. W., Sherman, B. T., and Lempicki, R. A. (2009) Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature Protoc. 4, 44−57. (16) Khandoudi, N., Delerive, P., Berrebi-Bertrand, I., Buckingham, R. E., Staels, B., and Bril, A. (2002) Rosiglitazone, a peroxisome proliferator-activated receptor-gamma, inhibits the Jun NH(2)-terminal kinase/activating protein 1 pathway and protects the heart from ischemia/reperfusion injury. Diabetes 51, 1507−1514. (17) Istvan, E. S., and Deisenhofer, J. (2001) Structural mechanism for statin inhibition of HMG-CoA reductase. Science 292, 1160−1164. (18) Cleuren, A. C., Van der Linden, I. K., De Visser, Y. P., Wagenaar, G. T., Reitsma, P. H., and Van Vlijmen, B. J. (2010) 17alphaEthinylestradiol rapidly alters transcript levels of murine coagulation genes via estrogen receptor alpha. J. Thromb. Haemost. 8, 1838−1846. (19) Bourdeau, V., Deschenes, J., Metivier, R., Nagai, Y., Nguyen, D., Bretschneider, N., Gannon, F., White, J. H., and Mader, S. (2004) Genome-wide identification of high-affinity estrogen response elements in human and mouse. Mol. Endocrinol. 18, 1411−1427. (20) Faulds, M. H., Zhao, C., Dahlman-Wright, K., and Gustafsson, J. A. (2012) The diversity of sex steroid action: regulation of metabolism by estrogen signaling. J. Endocrinol. 212, 3−12. (21) Rise, P., Pazzucconi, F., Sirtori, C. R., and Galli, C. (2001) Statins enhance arachidonic acid synthesis in hypercholesterolemic patients. Nutr. Metab. Cardiovasc. Dis. 11, 88−94.
AUTHOR INFORMATION
Corresponding Author
*Phone: (870) 543-7142. Fax: (870) 543-7854. E-mail: weida.
[email protected]. Funding
This study was fully financed by the U.S. Food and Drug Administration (E0721511). Notes
The views presented in this article do not necessarily reflect current or future opinion or policy of the U.S. Food and Drug Administration. Any mention of commercial products is for clarification and not intended as endorsement. The authors declare no competing financial interest.
■
ACKNOWLEDGMENTS K.Y., B.G., and M.L. are grateful to the National Center for Toxicological Research (NCTR) of the U.S. Food and Drug Administration (FDA) for postdoctoral support through the Oak Ridge Institute for Science and Education (ORISE).
■
ABBREVIATIONS TGx, toxicogenomics; LDA, latent Dirichlet allocation; MoAs, modes of actions; PPAR, peroxisome proliferator; FDA, Food and Drug Administration; SEQC, SEquencing Quality Control; NCTR, National Center for Toxicological Research; TGP, toxicogenomics project; DEG, differentially expressed gene; J-S, Jensen−Shannon; DAVID, Database for Annotation, VisualH
dx.doi.org/10.1021/tx500148n | Chem. Res. Toxicol. XXXX, XXX, XXX−XXX
Chemical Research in Toxicology
Article
(22) Chinetti, G., Lestavel, S., Bocher, V., Remaley, A. T., Neve, B., Torra, I. P., Teissier, E., Minnich, A., Jaye, M., Duverger, N., Brewer, H. B., Fruchart, J. C., Clavey, V., and Staels, B. (2001) PPAR-alpha and PPAR-gamma activators induce cholesterol removal from human macrophage foam cells through stimulation of the ABCA1 pathway. Nature Med. 7, 53−58. (23) Grip, O., Janciauskiene, S., and Lindgren, S. (2002) Atorvastatin activates PPAR-gamma and attenuates the inflammatory response in human monocytes. Inflamm. Res. 51, 58−62. (24) Zelvyte, I., Dominaitiene, R., Crisby, M., and Janciauskiene, S. (2002) Modulation of inflammatory mediators and PPARgamma and NFkappaB expression by pravastatin in response to lipoproteins in human monocytes in vitro. Pharmacol. Res. 45, 147−154. (25) Yano, M., Matsumura, T., Senokuchi, T., Ishii, N., Murata, Y., Taketa, K., Motoshima, H., Taguchi, T., Sonoda, K., Kukidome, D., Takuwa, Y., Kawada, T., Brownlee, M., Nishikawa, T., and Araki, E. (2007) Statins activate peroxisome proliferator-activated receptor gamma through extracellular signal-regulated kinase 1/2 and p38 mitogen-activated protein kinase-dependent cyclooxygenase-2 expression in macrophages. Circ. Res. 100, 1442−1451. (26) Pfaffl, M. W., Lange, I. G., Daxenberger, A., and Meyer, H. H. (2001) Tissue-specific expression pattern of estrogen receptors (ER): quantification of ER alpha and ER beta mRNA with real-time RT-PCR. Acta Pathol. Microbiol. Immunol. 109, 345−355. (27) Mwinyi, J., Cavaco, I., Pedersen, R. S., Persson, A., Burkhardt, S., Mkrtchian, S., and Ingelman-Sundberg, M. (2010) Regulation of CYP2C19 expression by estrogen receptor alpha: implications for estrogen-dependent inhibition of drug metabolism. Mol. Pharmacol. 78, 886−894. (28) Mwinyi, J., Cavaco, I., Yurdakok, B., Mkrtchian, S., and IngelmanSundberg, M. (2011) The ligands of estrogen receptor alpha regulate cytochrome P4502C9 (CYP2C9) expression. J. Pharmacol. Exp. Ther. 338, 302−309. (29) Higashi, E., Fukami, T., Itoh, M., Kyo, S., Inoue, M., Yokoi, T., and Nakajima, M. (2007) Human CYP2A6 is induced by estrogen via estrogen receptor. Drug Metab. Dispos. 35, 1935−1941. (30) Tsuchiya, Y., Nakajima, M., Kyo, S., Kanaya, T., Inoue, M., and Yokoi, T. (2004) Human CYP1B1 is regulated by estradiol via estrogen receptor. Cancer Res. 64, 3119−3125. (31) Gopal, S., Garibaldi, S., Goglia, L., Polak, K., Palla, G., Spina, S., Genazzani, A. R., Genazzani, A. D., and Simoncini, T. (2012) Estrogen regulates endothelial migration via plasminogen activator inhibitor (PAI-1). Mol. Hum. Reprod. 18, 410−416. (32) Lacroix, K. A., Bean, C., Reilly, R., and Curran-Celentano, J. (1997) The effects of hormone replacement therapy on antithrombin III and protein C levels in menopausal women. Clin. Lab. Sci. 10, 145−148. (33) Heistinger, M., Stockenhuber, F., Schneider, B., Pabinger, I., Brenner, B., Wagner, B., Balcke, P., Lechner, K., and Kyrle, P. A. (1990) Effect of conjugated estrogens on platelet function and prostacyclin generation in CRF. Kidney Int. 38, 1181−1186. (34) Amara, U., Rittirsch, D., Flierl, M., Bruckner, U., Klos, A., Gebhard, F., Lambris, J. D., and Huber-Lang, M. (2008) Interaction between the coagulation and complement system, In Current Topics in Complement II, pp 68−76, Springer, New York. (35) Sampath, H., and Ntambi, J. M. (2006) Stearoyl-coenzyme A desaturase 1, sterol regulatory element binding protein-1c and peroxisome proliferator-activated receptor-alpha: independent and interactive roles in the regulation of lipid metabolism. Curr. Opin. Clin. Nutr. Metab. Care 9, 84−88.
I
dx.doi.org/10.1021/tx500148n | Chem. Res. Toxicol. XXXX, XXX, XXX−XXX