Comparative Analysis of Predictive Models for Nongenotoxic

May 31, 2011 - Thus, the primary questions are how much better predictive performance using short-term TGx models can be achieved compared to that of ...
0 downloads 13 Views 1MB Size
ARTICLE pubs.acs.org/crt

Comparative Analysis of Predictive Models for Nongenotoxic Hepatocarcinogenicity Using Both Toxicogenomics and Quantitative StructureActivity Relationships Zhichao Liu,† Reagan Kelly,‡ Hong Fang,‡ Don Ding,‡ and Weida Tong*,† †

Center of Excellence for Bioinformatics, National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, Arkansas 72079, United States ‡ Z-Tech, an ICF International Company at FDA's National Center for Toxicological Research, 3900 NCTR Road, Jefferson, Arkansas 72079, United States

bS Supporting Information ABSTRACT: The primary testing strategy to identify nongenotoxic carcinogens largely relies on the 2-year rodent bioassay, which is time-consuming and labor-intensive. There is an increasing effort to develop alternative approaches to prioritize the chemicals for, supplement, or even replace the cancer bioassay. In silico approaches based on quantitative structureactivity relationships (QSAR) are rapid and inexpensive and thus have been investigated for such purposes. A slightly more expensive approach based on short-term animal studies with toxicogenomics (TGx) represents another attractive option for this application. Thus, the primary questions are how much better predictive performance using short-term TGx models can be achieved compared to that of QSAR models, and what length of exposure is sufficient for high quality prediction based on TGx. In this study, we developed predictive models for rodent liver carcinogenicity using gene expression data generated from shortterm animal models at different time points and QSAR. The study was focused on the prediction of nongenotoxic carcinogenicity since the genotoxic chemicals can be inexpensively removed from further development using various in vitro assays individually or in combination. We identified 62 chemicals whose hepatocarcinogenic potential was available from the National Center for Toxicological Research liver cancer database (NCTRlcdb). The gene expression profiles of liver tissue obtained from rats treated with these chemicals at different time points (1 day, 3 days, and 5 days) are available from the Gene Expression Omnibus (GEO) database. Both TGx and QSAR models were developed on the basis of the same set of chemicals using the same modeling approach, a nearest-centroid method with a minimum redundancy and maximum relevancy-based feature selection with performance assessed using compound-based 5-fold cross-validation. We found that the TGx models outperformed QSAR in every aspect of modeling. For example, the TGx models’ predictive accuracy (0.77, 0.77, and 0.82 for the 1-day, 3-day, and 5-day models, respectively) was much higher for an independent validation set than that of a QSAR model (0.55). Permutation tests confirmed the statistical significance of the model’s prediction performance. The study concluded that a short-term 5-day TGx animal model holds the potential to predict nongenotoxic hepatocarcinogenicity.

’ INTRODUCTION The major goal of carcinogenicity testing in rodents is to identify chemicals that may pose a carcinogenic risk to humans.1 Genotoxic chemicals can be commonly identified in the early stage of development using various in vitro assays such as the Ames test,2,3 although debates exist about the accuracy and usability of some of these assays. The subsequent testing to distinguish nongenotoxic carcinogens from noncarcinogens still relies on the 2-year rodent bioassay. The current bioassay protocol used by the National Toxicology Program (NTP) involves exposing a large number of animals to varying doses of the studied chemical with histopathological assessment of multiple organs and tissues in each of the animals at the end of the 2-year exposure period. If a chemical shows indications of This article not subject to U.S. Copyright. Published 2011 by the American Chemical Society

carcinogenicity, it is then evaluated by a panel of public and private sector experts.4 This process makes the cancer bioassay a significant investment in terms of money, time, and animal use. During the past 30 years, only ∼1500 chemicals have been studied by NTP.5 However, according to reports from the U.S. Environmental Protection Agency’s (EPA’s) Toxic Substances Control Act Inventory,6 there are more than 30,000 chemicals in widespread commercial use in the United States and Canada,7 and over 14,000 substances are involved in the Registration, Evaluation, Authorization and Restriction of Chemicals (REACH) in Europe.8 Therefore, only a small fraction of these agents have Received: February 7, 2011 Published: May 31, 2011 1062

dx.doi.org/10.1021/tx2000637 | Chem. Res. Toxicol. 2011, 24, 1062–1070

Chemical Research in Toxicology undergone carcinogenicity testing. An inexpensive and rapid approach to prioritize the chemicals for, supplement, or even replace the 2-year carcinogenicity bioassay would lead to a significant increase in the number of screened chemicals, which could improve the safety of individuals exposed to these compounds. There are several ways to prioritize chemicals for expensive assays. The production volume of a chemical and the degree of human exposure to it are important factors in the prioritization scheme. However, the toxicological activities of a chemical are the most heavily weighted factors in prioritization. While production volumes and potential human exposure levels are readily available, organ-specific toxicological profiles are lacking for the majority of environmental chemicals. In the past, in silico approaches based on quantitative structureactivity relationships (QSAR) have been useful in assessing potential toxicological effects and for predicting toxicological end points, including carcinogenicity prediction.914 The QSAR approach requires no investment in wet lab resources and allows for the efficient, rapid screening of a large number of chemicals. The fundamental premise of the QSAR approach is that chemicals with similar structures are likely to exhibit similar biological and toxicological profiles.15 This inherent limitation implies that QSAR may not fully capture the complexity of toxicological responses and therefore cannot accurately predict carcinogenicity or other complicated end points; for example, QSAR models only had an accuracy of about 60% with validation sets for carcinogenicity.912 It is worthwhile to point out that QSARs can be interpreted either as a broad practice of developing predictive models using chemical descriptors, physicochemical properties, and other types of chemical-centric representations without considering the type of statistical methods used (e.g., regression vs classification) or as a conventional definition with emphasis on the quantitative nature of this practice. In this study, we took the former definition with reference hereafter to QSARs in the context of developing classifiers using chemical descriptors. Toxicogenomics (TGx) or the application of genomic science to toxicology represents a new approach to assess, for example, which chemicals pose a carcinogenic risk.1618 This growing field of research specifically tackles the complex interaction between toxic effects elicited by exogenous stimuli (e.g., chemical compounds) and the structure and/or activity of the genome.19,20 As a basic TGx tool, microarray methodology simultaneously analyzes tens of thousands of individual gene expression levels and thus permits the assessment of characteristic modification in gene expression profiles induced by toxic compounds. TGx approaches often utilize a short-term animal study in which gene expression data are obtained from animals that are exposed to individual chemicals at multiple doses and time points.1924 Both the dosage and treatment time dependent profiles are normally needed to illustrate the underlying mechanisms of toxicity and/or to derive a predictive model. TGx models based on a short-term animal study with minimal treatment time and a single dose design have the advantage in having both a reduced time required for assessment and a lower cost of assessment in terms of animal resources and manpower than a 2-year assay. The regulatory agencies, such as the U.S. Food and Drug Administration,25 consider TGx as an emerging tool with potential utility for safety evaluation. Conceptually, a TGx approach to predicting carcinogenicity should be more effective as it involves biological data for each compound. The primary questions are, how effective TGx is compared to QSAR, and what length of chemical exposure is sufficient for the accurate prediction using the former

ARTICLE

approach? Exposure times on the order of a few days or a week would offer significant savings and considerably speed the screening process. In this study, we generated predictive signatures for nongenotoxic hepatocarcinogens from gene expression data generated with short-term animal models lasting 1 day, 3 days, and 5 days. We specifically compared the performance of the models generated from different time points to assess the relationship between exposure time and model accuracy. The predictive performance of the models constructed using gene expression was also compared with that of QSAR models which served as a metric for assessing the practical utility of the TGx models. In addition, we assessed the genes selected to be part of the TGx models by enriching them with pathway information to begin the identification of underlying mechanisms revealed by the signatures.

’ MATERIALS AND METHODS Data Set. The time course gene expression profiles used in this study are a subset of a large liver xenobiotic and pharmacological response database produced by Iconix Biosciences (part of Entelos Inc.). This data set is publicly available at the Gene Expression Omnibus (GEO) Web site (http://www.ncbi.nlm.nih.gov/geo/) under accession number GSE8858. The data set contains 5288 microarrays associated with animals treated with 344 chemicals. Three different layouts of the GEHealthcare/Amersham Bioscience CodeLink UniSet Rat I Bioarray (EXP5280X2-584, EXP5280X2-613, and EXP5280X2-648) containing more than 10,000 probes were employed to analyze gene expression in the livers of treated rats.17 In this study, we included only treatments that were administered daily for 1 day, 3 days, and 5 days at a maximum tolerated dose with two or three biological replicates. We developed a liver cancer database called NCTRlcdb (http:// www.fda.gov/ScienceResearch/BioinformaticsTools/ucm236173.htm)14 based on the Carcinogenicity Potency Database (CPDB).5 The NCTRlcdb contains 999 chemicals with liver carcinogenicity annotations based on a hierarchical rule for every combination of species (rat and mouse) and sex.14 In addition, the overall score of carcinogenic potential for each chemical was also available, which was used in this study. We only selected these chemicals that have gene expression data at all three different time points (1 day, 3 days, and 5 days) from the GSE8858 data set with class labels of either liver carcinogens, carcinogens in other organs, or noncarcinogens by NCTRlcdb. For purposes of model training, liver carcinogens were defined as positives, while chemicals that (a) do not cause carcinogenicity and (b) only cause carcinogenicity in nonliver organs were defined as negatives. For the liver carcinogens, only those showing nongenotoxic carcinogenicity in rat liver were used in this study.17 A total of 62 chemicals were selected, which are summarized in Table 1; in addition, their chemical structure data are available in Supporting Information, Table 1. Gene expression was measured using three different layouts of the CodeLink UniSet Rat I Bioarray, and 10,399 common probe sets were identified. Among the 10,399 probe sets, probe sets missing more than 40% of values were excluded. After probe filtering, the other missing expression values were imputed using nearest neighbor averaging,26 yielding a subset of 10,348 probes. Biological replicates were combined with matched control samples to calculate log ratios, and then the average of the biological replicates was calculated for modeling. Quantile normalization was employed to normalize the microarray data of different time points.24 Data Analysis Method. Model Generation. Figure 1 provides a schematic overview of the modeling procedure performed in this study. • Step 1: 62 chemicals were randomly divided into training and independent validation sets with approximately one-third of the 1063

dx.doi.org/10.1021/tx2000637 |Chem. Res. Toxicol. 2011, 24, 1062–1070

Chemical Research in Toxicology

ARTICLE

Table 1. Chemicals Used in the Training and Validation Sets class

compounds

CAS number

dose (mg/kg)

classification

Training Set liver carcinogens (nongenotoxic)

nonliver carcinogens

noncarcinogens

carbon_tetrachloride

56-23-5

3178/1, 5 days; 1175/3 day

positive

dipyrone

68-89-3

1636

positive

ethanol

64-17-5

6000/1, 3 days; 3000/5 day

positive

griseofulvin

126-07-8

2500

positive

isoniazid

54-85-3

79

positive

phenobarbital

50-06-6

80

positive

pirinixic_acid

50892-23-4

364

positive

prednisolone beta-estradiol

50-24-8 50-28-2

184 150

positive negative

catechol

120-80-9

195

negative

chlorambucil

305-03-3

4.5

negative

dichlorvos

62-73-7

17

negative

ergocalciferol

50-14-6

15

negative

hydrazine

302-01-2

45

negative

ifosfamide

3778-73-2

143

negative

lead_(ii)_acetate nitrofurantoin

301-04-2 67-20-9

600 76

negative negative

phenacetin

62-44-2

619

negative

propylthiouracil

51-52-5

625

negative

1,1-dichloroethene

75-35-4

600

negative

6-mercaptopurine

50-44-2

25

negative

aspirin

50-78-2

500/1, 3 days; 375/5 day

negative

benzothiazyl_disulfide

120-78-5

2600

negative

bisphenol_a choline_chloride

80-05-7 67-48-1

610 2550

negative negative

ethylene_glycol

107-21-1

3525

negative

hexachlorophene

70-30-4

8

negative

hydrocortisone

50-23-7

56

negative

methotrexate N,N0 -diphenyl-p-phenylenediamine

59-05-2 74-31-7

27/1, 3 days; 0.3/5 day 1000

negative negative

niacin

59-67-6

2625

negative

niacinamide oxyquinoline

98-92-0 148-24-3

750 68

negative negative

phenothiazine

92-84-2

386

negative

praziquantel

55268-74-1

1200

negative

propylene_glycol

57-55-6

2000

negative

tretinoin

302-79-4

7

negative

vinblastine

865-21-4

0.3

negative

bithionol

97-18-7

333

negative

omeprazole

73590-58-6

415

negative

Validation Set liver carcinogens

nonliver carcinogens

bis (2-ethylhexyl)phthalate

117-81-7

1000

positive

chloroform

67-66-3

600

positive

nafenopin

3771-19-5

338

positive

rifampin

13292-46-1

99

positive

acetaminophen

103-90-2

972

positive

3-methylcholanthrene

56-49-5

300

negative

azathioprine capsaicin

446-86-6 404-86-4

160/1, 3 days; 54/5 days 35

negative negative

diethylstilbestrol

56-53-1

280

negative

lead(iv)_acetate

1335-32-6

600

negative

mitomycin_c

50-07-7

1.7

negative

1064

dx.doi.org/10.1021/tx2000637 |Chem. Res. Toxicol. 2011, 24, 1062–1070

Chemical Research in Toxicology

ARTICLE

Table 1. Continued class

noncarcinogens

compounds

CAS number

dose (mg/kg)

classification

procarbazine streptozotocin

671-16-9 18883-66-4

54 138

negative negative negative

cortisone

50-23-7

206

cyclosporin_a

59865-13-3

350

negative

diazepam

439-14-5

710

negative

mestranol

72-33-3

250

negative

pyrazinamide

98-96-4

1500

negative

allyl_alcohol

107-18-6

32

negative

busulfan ibuprofen

55-98-1 15687-27-1

36/1 day; 9/3, 5 days 263

negative negative

tolazamide

1156-19-0

1500

negative

was developed using the same feature selection method as that in Step 2. These 2,000 classifiers corresponding to 2,000 permuted training sets were used to predict the validation set. • Step 5: We applied the NCC model to each compound in the validation set to calculate prediction performance (more in the next section). Prediction Performance. The predictive performance of an NCC in this study is characterized using five metrics: accuracy, specificity, sensitivity, Matthew’s correlation coefficient (MCC), and the binary area under the ROC curve (AUC). These metrics are obtained by performing calculations based on the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) in the relevant data set: Accuracy ¼

TP þ TN TP þ TN þ FP þ FN

ð1Þ

Figure 1. Workflow for classifier development and validation (mRMR stands for minimum redundancymaximum relevancy).

Specificity ¼

TN TN þ FP

ð2Þ

chemicals being assigned to the independent validation set and the remainder assigned to the training set. This resulted in a positive/ negative ratio of 8/32 and 5/17 in the training and validation sets, respectively. The detailed summary of 62 chemicals is listed in Table 1. • Step 2: The process involved building and assessing candidate models using the training set. In order to compare the model performance at different time points, nearest-centroid classifier (NCC) models were built with microarray data from each time point. In the training set of each time point, a minimum redundancymaximum relevancy (mRMR)27 procedure was used to rank features (i.e., genes), and models were constructed using incremental feature selection for the 500 highest ranked features. Five hundred nested NCC models were constructed, with each model i containing all of the features from model i-1 (i > 1). Fifty iterations of 5-fold cross-validation (CV) based on compounds were performed on each of the 500 NCC models, and the single model from this set with the greatest CV accuracy was selected. • Step 3: A new NCC model was then constructed using this set of features with the full training set. • Step 4: The predictive performance of this model was measured by performing a permutation test28 and determining whether predictive performance exceeded what would be expected by chance alone. Specifically, 2,000 permuted data sets were generated for the training set, in which the chemicals’ class labels were randomly scrambled. For each permuted training set, a classifier

Sensitivity ¼

TP TP þ FN

ð3Þ

TP  TN  FP  FN MCC ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðTP þ FPÞ  ðTP þ FNÞ  ðTN þ FPÞ  ðTN þ FNÞ ð4Þ

AUC ¼

Specificity þ Sensitivity 2

ð5Þ

Pathway Analysis. To determine the biological relevance of the signature genes selected by the model, the genes were mapped to pathways for enrichment analysis using Ingenuity Pathway Analysis (IPA) software (http://www.ingenuity.com/). QSAR Modeling. Descriptors for the QSAR study were generated using MOLD2,29 which included 777 1D and 2D descriptors. The descriptors with values of zero for more than 40% of the compounds were excluded, leaving 424 descriptors that were normalized. The QSAR modeling was carried out using the same approaches described above for the TGx models. Software. All calculations were carried out using Matlab (R2010a). The mRMR package was downloaded from http://penglab.janelia.org/ proj/mRMR/. MOLD2 software can be downloaded from http://www. fda.gov/ScienceResearch/BioinformaticsTools/Mold2/default.htm. 1065

dx.doi.org/10.1021/tx2000637 |Chem. Res. Toxicol. 2011, 24, 1062–1070

Chemical Research in Toxicology

Figure 2. Prediction accuracy in 5-fold compound-based cross-validation (CV) is plotted against the number of features selected for the TGx and QSAR models. Each point represents the average accuracy from 50 repetitions of the 5-fold CV carried out in parallel.

ARTICLE

Figure 3. Predictive performance and standard deviation for 5 different metrics (MCC, AUC, accuracy, sensitivity, and specificity) for the TGx and QSAR models for both cross-validation (CV) and independent validation (IV).

’ RESULTS Feature Selection of Significant Genes and QSAR Descriptors. Figure 2 shows the average accuracy from 50 repetitions of

the 5-fold CV for the feature selection process in the QSAR model and the 1-day, 3-day, and 5-day TGx models. The performance of all three TGx models roughly stabilized after the feature size reached 90; further increasing the number of features did not change the model performance or slightly decreased the performance. Thus, the first 90 genes (Supporting Information, Table 2) were selected as the features for each of the three time point models in order to fairly compare them. Note that the 90-gene model may not necessarily have the best performance, but a stable model is likely to improve predictive performance for the independent validation sets. For the QSAR model, the MOLD2 descriptors were selected using the same strategy. The average accuracy reaches its maximum when the number of features is equal to 15, after which the performance dramatically decreases. Therefore, the first 15 descriptors were selected as the features to develop the QSAR model (Supporting Information, Table 2). Model Performance. As depicted in Figure 1, once the final TGx and QSAR models were constructed, 2,000 repetitions of 5-fold compound-based CV were performed, and the models were also applied to an independent validation (IV) set. Five parameters (MCC, AUC, accuracy, sensitivity, and specificity) were employed to provide an objective and comprehensive evaluation of model performance. Figure 3 shows the five performance metrics in both CV and IV for the TGx and QSAR models. The CV accuracy of the TGx models increases with the number of days, and the average accuracy for all three time points is greater than 0.85. In contrast, the QSAR model has a slightly lower CV accuracy (average value = 0.76) than any of the three TGx models. A similar trend can also be observed in the other four CV performance metrics. Furthermore, the predictive performance of the models on the IV set is consistent with the CV results. For example, the 5-day model has the highest IV accuracy, and the IV accuracy of QSAR model (0.55) is lower than all of the TGx models (0.77, 0.77, and 0.82 for the 1-day, 3-day, and 5-day models, respectively). We also compared the difference in the five performance metrics between the CV and IV (Figure 4), denoted as |CV  IV|, for the TGx and QSAR models. The |CV  IV| value measures the concordance, i.e., a large |CV  IV| value indicates

Figure 4. Absolute difference, |CV  IV|, between the predictive performance for the five performance metrics in the 5-fold crossvalidation of the training set and the independent validation set for the TGx and QSAR models.

Figure 5. Comparison in the prediction accuracy of the validation set between 2,000 models derived from a permutation test (a mean distribution) and the real data set (solid dot).

either overtraining in the training model (CV > IV) or an unreliable extrapolation (IV > CV) since the internal validation should not perform significantly better than the external validation. As well as having the best overall performance in both CV and IV independently, the 5-day TGx model also has the smallest |CV  IV| value for all performance metrics (except specificity). 1066

dx.doi.org/10.1021/tx2000637 |Chem. Res. Toxicol. 2011, 24, 1062–1070

Chemical Research in Toxicology

ARTICLE

Table 2. Pathway and Functional Analysis Based on 90 Signature Genes of the 5-Day TGx Model pathways

p-value

genes

Canonical Pathways aryl hydrocarbon receptor signaling

1.18  103

TGM2, MGST1, NR0B2, ESR2

actin cytoskeleton signaling

5.78  103

FN1, PIK3R1, ARHGEF1, MSN

GR12/13 signaling

7.59  103

PIK3R1, MEF2D, ARHGEF1

estrogen receptor signaling

9.07  103

CCNC, NR0B2, ESR2

non-small cell lung cancer signaling

2.04  102

FHIT, PIK3R1

small cell lung cancer signaling

2.4  102

FHIT, PIK3R1

VDR/RXR activation

2.68  102

CCNC, HR

reelin signaling in neurons LPS/IL-1 mediated inhibition of RXR function

2.93  102 2.97  102

PIK3R1, ARHGEF1 MGST1, FABP2, NR0B2

linoleic acid metabolism

3.28  102

FADS2, Cyp2b19

TR/RXR activation

3.35  102

PIK3R1, THRB

p53 signaling

3.95  102

CCNG1, PIK3R1

cancer

6.70  1051.83  102

ADORA2A,CCNG1,ESR2, FHIT,FN1,GJA4,

gastrointestinal disease

6.70  1052.01  102

ADORA2A, ESR2, FHIT, FN1, LIPF, MGST1, NR0B2, PIK3R1, TNFRSF9, TSHR

hematological disease

6.70  1052.01  102

ACADM, ADORA2A, ESR2, FHIT, FN1,

Diseases and Disorders PIK3R1,TGM2,THRB,TNFRSF9,TSHR

LOC10431/TIMM23, NEXN, NR0B2, NUP35, PIK3R1, POLR3A, RYR3, TGM2, THRB, TNFRSF9 hepatic system disease

6.70  1051.70  102

ADORA2A, ESR2, FN1, LIPF, MGST1,

organismal injury

1.93  1042.01  102

ADORA2A, FHIT, FN1, PIK3R1, TGM2, TSHR

cell death

1.12  1052.01  102

ADORA2A, ATXN3, BIRC6, CCNG1, ESR2, EXOC2, FHIT, FN1, MEF2D, MGST1, MSN,

cell-to-cell signaling and interaction

3.36  1052.01  102

ADAM15, ADORA2A, ARHGEF1, BHLHA15,

NR0B2, PIK3R1, TNFRSF9, TSHR

Molecular and Cellular Functions

NR0B2, PIK3R1, TGM2, THRB, TNFRSF9 CCNC, CDHR5, ESR2, FN1, GJA4, MGST1, PIK3R1, RYR3, TGM2, TNFRSF9, TSHR cellular movement

5.88  1052.01  102

ADAM15, ADORA2A, ARHGEF1, CCNG1, ESR2, FHIT, FN1, IGBP1, NEXN, PIK3R1, TGM2, THRB, TNFRSF9, TSHR

amino acid metabolism

1.38  1041.35  102

ADORA2A, GPSM1, THRB, TSHR

cell cycle

4.69E-0.42.01  102

CCNG1, ESR2, FHIT, FN1, PIK3R1, THRB, TNFRSF9

Furthermore, the QSAR model had the highest |CV  IV| value for all performance metrics, indicating that the QSAR model tends to be overfitting for this data set. Results of Permutation Test. Figure 5 shows the results of the permutation tests to assess whether the models predict the validation set better than would be expected by chance alone. As with the findings described in the previous section, the 5-day TGx model performs best. Pathway Analysis. The 5-day TGx model was chosen to carry out pathway and functional analysis because of its favorable performance results. The 90 selected genes in the 5-day TGx model were carried out for pathway analysis using IPA. The significant (p < 0.05 in Fisher’s exact test) canonical pathways, disease and disorders, and molecular and cellular functions are listed in Table 2.

A number of significant pathways have been reported to be associated with carcinogenesis including aryl hydrocarbon receptor signaling,30 actin cytoskeleton signaling,31 GR12/13 signaling,32 p53 signaling,33 and estrogen receptor signaling.34 In addition, carcinogenesis-related functions including cell death, cell-to-cell signaling and interaction, cellular movement, amino acid metabolism, and cell cycle were found. Diseases associated with these genes include cancer, gastrointestinal disease, hematological disease, hepatic system disease, and organismal injury.

’ DISCUSSION In most product development processes (e.g., drugs, food additives), in vitro assays such as the AMES assay are first applied to identify genotoxic chemicals. The remaining chemicals could 1067

dx.doi.org/10.1021/tx2000637 |Chem. Res. Toxicol. 2011, 24, 1062–1070

Chemical Research in Toxicology be either nongenotoxic carcinogens or noncarcinogens, which are usually distinguished by the costly 2-year rodent bioassay. Thus, in this study, we focused on the evaluation of three TGx approaches with varying treatment times along with QSAR for distinguishing nongenotoxic hepatocarcinogens from nonliver carcinogens. This is the first attempt to place different TGx strategies and QSAR in the same context (i.e., the same sets of compounds for training and validation as well as the same classifier method used) for comparative analysis. The results clearly indicate the superiority of the TGx models over QSAR for predicting hepatocarcinogenicity. The TGx model based on a 5-day treatment using a maximum tolerated dose yielded the best results (82% prediction accuracy, 60% sensitivity, and 88% specificity) for an independent validation set (Figure 3). TGx models for nongenotoxic carcinogenicity have been studied in the literature. However, some of these models used small data sets,35 were not tested on an independent validation set,18,36 or were based on an incomplete modeling approach.37 In this study, a comprehensive modeling strategy was applied. Specifically, we applied an incremental feature selection method (mRMR) in conjunction with compound-based cross-validation to identify the features that likely lead to a better model. The resulting models were also subjected to internal validation and permutation tests, as well as being tested on an independent validation set. In addition, we also applied five different statistical measures (accuracy, specificity, sensitivity, MCC, and AUC) to provide a complete picture of the model performance. The current TGx testing strategies for carcinogenic potential varies in the literature, particularly in terms of the treatment time. For example, Elrick et al.38 found evidence for a role of metabolic and oxidative stress genes in the nongenotoxic carcinogenic effects of phenobarbital in rat liver after a 13-week treatment; Nakayama et al.37 identified distinctive gene expression deregulations in rat liver after 28 days of treatment; EllingerZiegelbauer et al.35 indicated that gene expression can reflect the carcinogenic potential of a compound after 14 days of treatment; Fielden et al. applied a TGx approach with 5-day treatment to identify nongenotoxic carcinogens;16 Irwin et al.4 developed reasonable TGx models for carcinogenic prediction using a 2-day treatment with a minimum toxic dose; and Nie et al.18 showed that gene expression biomarkers following a 1-day exposure could predict tumor formation for nongenotoxic hepatocarcinogens. Nongenotoxic carcinogenesis involves multistep stages believed to require chronic exposure. Thus, conceptually, increasing the treatment time should result in a more mechanistically relevant TGx model. The question is how long the treatment time should be so that a sound TGx model can be derived without dramatically increasing the cost for the experiment. Consensus or guidelines could be difficult to develop because the multiple parameters involved in study design affect this decision such as the choice of species and strains, the dosage applied, the sensitivity of selected microarray methodology, the minimum acceptance for model performance, etc. However, given a study design, a comparative analysis is desirable to identify a point which maximizes the potential of the TGx approach while minimizing the associated costs. Our study represents a first step toward that direction. QSAR models have had many successful applications in biological end points such as predicting estrogen receptor (ER) and androgen receptor (AR) relative binding affinities as well as chemical physical properties.3941 However, binding is a relatively straightforward biological mechanism, whereas nongenotoxic

ARTICLE

carcinogenesis is quite complicated and involves multiple factors and mechanisms. Thus, nongenotoxic carcinogenesis may not be in the application domain of QSAR models, which only considers chemical structure. QSAR has been widely used in toxicology and considered a valuable approach for prioritizing chemicals for more expensive assays.42,43 Using the same training and test compounds with the same modeling approach, the QSAR model (accuracy = 0.76 for training and 0.55 for validation) performed poorly compared to the TGx models in both cross-validation (accuracy = 0.87, 0.87, and 0.90 for 1-day, 3-day, and 5-day, respectively) and independent validation (0.77, 0.77, and 0.82). Because QSAR modeling relies solely on chemical structure, the development of a QSAR model should not be restricted with the same training and validation sets defined by the TGx approach. With a large data set, the quality of a QSAR model usually improves. For that purpose, we compared our TGx models with the QSAR models developed by Young et al. based on the entire NCTRlcdb data set.14 Their QSAR models based on a training set of 750 chemicals have ∼63% prediction accuracy for a validation set of 249 chemicals using three different modeling approaches, which is still less accurate compared to the TGx models using a rather small data set (i.e., 62 chemicals). The results further demonstrated that TGx is a more accurate approach to deal adequately with complex biological end points, such as nongenotoxic carcinogenicity. Although our study was focused on nongenotoxic liver carcinogenicity, the conclusions are likely extendable for predicting other types of toxicity with short-term animal models. This approach, with its significantly shorter time and lower expense, at the very least offers a way to prioritize chemicals for the 2-year rodent bioassay. The time and cost savings mean that TGx modeling offers a distinct advantage over the rodent bioassay alone, and expanding TGx modeling into routine use to screen chemicals has the potential to offer both savings and speed. To maximize the potential of TGx modeling, it is highly recommended that a comparative analysis between different TGx designs along with QSAR be performed as was done here.

’ ASSOCIATED CONTENT

bS

Supporting Information. Chemical structure data of the studied data set and the list of features for both QSAR and 5-day TGx models. This material is available free of charge via the Internet at http://pubs.acs.org.

’ AUTHOR INFORMATION Corresponding Author

*E-mail: [email protected]. Author Contributions

Z.L. developed the methods, performed all calculations and data analysis, and wrote the first draft of the manuscript. W.T. had the original idea and guided the data analysis and presentation of results. R.K., H.F., and D.D. contributed to the data analysis, verified the calculations, and assisted with writing the manuscript. All authors read and approved the final manuscript. Funding Sources

Z.L. is grateful to the National Center for Toxicological Research (NCTR) of U.S. Food and Drug Administration (FDA) for 1068

dx.doi.org/10.1021/tx2000637 |Chem. Res. Toxicol. 2011, 24, 1062–1070

Chemical Research in Toxicology postdoctoral support through the Oak Ridge Institute for Science and Education (ORISE).

’ DISCLOSURE The views presented in this article do not necessarily reflect those of the U.S. Food and Drug Administration. ’ REFERENCES (1) Bucher, J. R., and Portier, C. (2004) Human carcinogenic risk evaluation, Part V: The national toxicology program vision for assessing the human carcinogenic hazard of chemicals. Toxicol. Sci. 82, 363–366. (2) de Serres, F. J., and Shelby, M. D. (1979) The Salmonella mutagenicity assay: recommendations. Science 203, 563–565. (3) Lee, K. H., and Lee, B. M. (2007) Study of mutagenicities of phthalic acid and terephthalic acid using in vitro and in vivo genotoxicity tests. J. Toxicol. Environ. Health A 70, 1329–1335. (4) Auerbach, S. S., Shah, R. R., Mav, D., Smith, C. S., Walker, N. J., Vallant, M. K., Boorman, G. A., and Irwin, R. D. (2010) Predicting the hepatocarcinogenic potential of alkenylbenzene flavoring agents using toxicogenomics and machine learning. Toxicol. Appl. Pharmacol. 243, 300–314. (5) Gold, L. S., Manley, N. B., Slone, T. H., Rohrbach, L., and Garfinkel, G. B. (2005) Supplement to the Carcinogenic Potency Database (CPDB): results of animal bioassays published in the general literature through 1997 and by the National Toxicology Program in 19971998. Toxicol. Sci. 85, 747–808. (6) EPA (2004) What is the TSCA chemical substance inventory? USEPA. (7) Muir, D. C., and Howard, P. H. (2006) Are there other persistent organic pollutants? A challenge for environmental chemists. Environ. Sci. Technol. 40, 7157–7166. (8) (2008) ECHA-list of preregistered substances, REACH. (9) Benigni, R. (1991) QSAR prediction of rodent carcinogenicity for a set of chemicals currently bioassayed by the US National Toxicology Program. Mutagenesis 6, 423–425. (10) Contrera, J. F., Kruhlak, N. L., Matthews, E. J., and Benz, R. D. (2007) Comparison of MC4PC and MDL-QSAR rodent carcinogenicity predictions and the enhancement of predictive performance by combining QSAR models. Regul. Toxicol. Pharmacol. 49, 172–182. (11) Fjodorova, N., Vracko, M., Novic, M., Roncaglioni, A., and Benfenati, E. (2010) New public QSAR model for carcinogenicity. Chem. Cent. J. 4 (Suppl 1), S3. (12) Helguera, A. M., Cordeiro, M. N., Perez, M. A., Combes, R. D., and Gonzalez, M. P. (2008) QSAR modeling of the rodent carcinogenicity of nitrocompounds. Bioorg. Med. Chem. 16, 3395–3407. (13) Richard, A. M. (1998) Structure-based methods for predicting mutagenicity and carcinogenicity: are we there yet?. Mutat. Res. 400, 493–507. (14) Young, J., Tong, W., Fang, H., Xie, Q., Pearce, B., Hashemi, R., Beger, R., Cheeseman, M., Chen, J., Chang, Y. C., and Kodell, R. (2004) Building an organ-specific carcinogenic database for SAR analyses. J. Toxicol. Environ. Health A 67, 1363–1389. (15) McKinney, J. D., Richard, A., Waller, C., Newman, M. C., and Gerberick, F. (2000) The practice of structure activity relationships (SAR) in toxicology. Toxicol. Sci. 56, 8–17. (16) Fielden, M. R., Brennan, R., and Gollub, J. (2007) A gene expression biomarker provides early prediction and mechanistic assessment of hepatic tumor induction by nongenotoxic chemicals. Toxicol. Sci. 99, 90–100. (17) Natsoulis, G., Pearson, C. I., Gollub, J., B, P. E., Ferng, J., Nair, R., Idury, R., Lee, M. D., Fielden, M. R., Brennan, R. J., Roter, A. H., and Jarnagin, K. (2008) The liver pharmacological and xenobiotic gene response repertoire. Mol. Syst. Biol. 4, 175. (18) Nie, A. Y., McMillian, M., Parker, J. B., Leone, A., Bryant, S., Yieh, L., Bittner, A., Nelson, J., Carmen, A., Wan, J., and Lord P. G. (2006) Predictive toxicogenomics approaches reveal underlying

ARTICLE

molecular mechanisms of nongenotoxic carcinogenicity. Mol. Carcinog. 45, 914–933. (19) Aardema, M. J., and MacGregor, J. T. (2002) Toxicology and genetic toxicology in the new era of “toxicogenomics”: impact of “-omics” technologies. Mutat. Res. 499, 13–25. (20) Gatzidou, E. T., Zira, A. N., and Theocharis, S. E. (2007) Toxicogenomics: a pivotal piece in the puzzle of toxicological research. J. Appl. Toxicol. 27, 302–309. (21) Aubrecht, J., and Caba, E. (2005) Gene expression profile analysis: an emerging approach to investigate mechanisms of genotoxicity. Pharmacogenomics 6, 419–428. (22) Bhogal, N., Grindon, C., Combes, R., and Balls, M. (2005) Toxicity testing: creating a revolution based on new technologies. Trends Biotechnol. 23, 299–307. (23) Lettieri, T. (2006) Recent applications of DNA microarray technology to toxicology and ecotoxicology. Environ. Health Perspect. 114, 4–9. (24) Newton, R. K., Aardema, M., and Aubrecht, J. (2004) The utility of DNA microarrays for characterizing genotoxicity. Environ. Health Perspect. 112, 420–422. (25) Leighton, J. K., Brown, P., Ellis, A., Harlow, P., Harrouk, W., Pine, P. S., Robison, T., Rosario, L., and Thompson, K. (2006) Workgroup report: Review of genomics data based on experience with mock submissions: view of the CDER Pharmacology Toxicology Nonclinical Pharmacogenomics Subcommittee. Environ. Health Perspect. 114, 573–578. (26) Schwender, H. (2007) Statistical Analysis of Genotype and Gene Expression Data, Dissertation, Department of Statistics, University of Dortmund, Germany. (27) Ding, C., and Peng, H. (2005) Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol. 3, 185–205. (28) Radmacher, M. D., McShane, L. M., and Simon, R. (2002) A paradigm for class prediction using gene expression profiles. J. Comput. Biol. 9, 505–511. (29) Hong, H., Xie, Q., Ge, W., Qian, F., Fang, H., Shi, L., Su, Z., Perkins, R., and Tong, W. (2008) Mold(2), molecular descriptors from 2D structures for chemoinformatics and toxicoinformatics. J. Chem. Inf. Model. 48, 1337–1344. (30) Dertinger, S. D., Nazarenko, D. A., Silverstone, A. E., and Gasiewicz, T. A. (2001) Aryl hydrocarbon receptor signaling plays a significant role in mediating benzo[a]pyrene- and cigarette smoke condensate-induced cytogenetic damage in vivo. Carcinogenesis 22, 171–177. (31) Thamilselvan, V., and Basson, M. D. (2005) The role of the cytoskeleton in differentially regulating pressure-mediated effects on malignant colonocyte focal adhesion signaling and cell adhesion. Carcinogenesis 26, 1687–1697. (32) Spiegelberg, B. D., and Hamm, H. E. (2007) Roles of G-protein-coupled receptor signaling in cancer biology and gene transcription. Curr. Opin. Genet. Dev. 17, 40–44. (33) Goh, A. M., Coffill, C. R., and Lane, D. P. (2011) The role of mutant p53 in human cancer. J. Pathol. 223, 116–126. (34) Levin, E. R., and Pietras, R. J. (2008) Estrogen receptors outside the nucleus in breast cancer. Breast Cancer Res. Treat. 108, 351–361. (35) Ellinger-Ziegelbauer, H., Gmuender, H., Bandenburg, A., and Ahr, H. J. (2008) Prediction of a carcinogenic potential of rat hepatocarcinogens using toxicogenomics analysis of short-term in vivo studies. Mutat. Res. 637, 23–39. (36) Uehara, T., Hirode, M., Ono, A., Kiyosawa, N., Omura, K., Shimizu, T., Mizukawa, Y., Miyagishima, T., Nagao, T., and Urushidani, T. (2008) A toxicogenomics approach for early assessment of potential non-genotoxic hepatocarcinogenicity of chemicals in rats. Toxicology 250, 15–26. (37) Nakayama, K., Kawano, Y., Kawakami, Y., Moriwaki, N., Sekijima, M., Otsuka, M., Yakabe, Y., Miyaura, H., Saito, K., Sumida, K., and Shirai, T. (2006) Differences in gene expression profiles in the liver between carcinogenic and non-carcinogenic isomers of compounds given to rats in a 28-day repeat-dose toxicity study. Toxicol. Appl. Pharmacol. 217, 299–307. 1069

dx.doi.org/10.1021/tx2000637 |Chem. Res. Toxicol. 2011, 24, 1062–1070

Chemical Research in Toxicology

ARTICLE

(38) Elrick, M. M., Kramer, J. A., Alden, C. L., Blomme, E. A., Bunch, R. T., Cabonce, M. A., Curtiss, S. W., Kier, L. D., Kolaja, K. L., Rodi, C. P., and Morris, D. L. (2005) Differential display in rat livers treated for 13 weeks with phenobarbital implicates a role for metabolic and oxidative stress in nongenotoxic carcinogenicity. Toxicol. Pathol. 33, 118–126. (39) Fang, H., Tong, W., Shi, L. M., Blair, R., Perkins, R., Branham, W., Hass, B. S., Xie, Q., Dial, S. L., Moland, C. L., and Sheehan, D. M. (2001) Structure-activity relationships for a large diverse set of natural, synthetic, and environmental estrogens. Chem. Res. Toxicol. 14, 280–294. (40) Hong, H., Fang, H., Xie, Q., Perkins, R., Sheehan, D. M., and Tong, W. (2003) Comparative molecular field analysis (CoMFA) model using a large diverse set of natural, synthetic and environmental chemicals for binding to the androgen receptor. SAR QSAR Environ. Res. 14, 373–388. (41) Iwasa, J., Fujita, T., and Hansch, C. (1965) Substituent constants for aliphatic functions obtained from partition coefficients. J. Med. Chem. 8, 150–153. (42) Hong, H., Tong, W., Fang, H., Shi, L., Xie, Q., Wu, J., Perkins, R., Walker, J. D., Branham, W., and Sheehan, D. M. (2002) Prediction of estrogen receptor binding for 58,000 chemicals using an integrated system of a tree-based model with structural alerts. Environ. Health Perspect. 110, 29–36. (43) Shi, L., Tong, W., Fang, H., Xie, Q., Hong, H., Perkins, R., Wu, J., Tu, M., Blair, R. M., Branham, W. S., Waller, C., Walker, J., and Sheehan, D. M. (2002) An integrated “4-phase” approach for setting endocrine disruption screening priorities--phase I and II predictions of estrogen receptor binding affinity. SAR QSAR Environ. Res. 13, 69–88.

1070

dx.doi.org/10.1021/tx2000637 |Chem. Res. Toxicol. 2011, 24, 1062–1070