Comparative Evaluation of in Silico Systems for Ames Test

May 2, 2011 - Organic Process Research & Development 2015 19 (11), 1507-1516 .... The Use of In Silico Models Within a Large Pharmaceutical Company ...
0 downloads 0 Views 2MB Size
ARTICLE pubs.acs.org/crt

Comparative Evaluation of in Silico Systems for Ames Test Mutagenicity Prediction: Scope and Limitations Alexander Hillebrecht,* Wolfgang Muster, Alessandro Brigo, Manfred Kansy, Thomas Weiser, and Thomas Singer F. Hoffmann-La Roche Ltd., Non-Clinical Safety, Basel CH-4070, Switzerland ABSTRACT: The predictive power of four commonly used in silico tools for mutagenicity prediction (DEREK, Toxtree, MC4PC, and Leadscope MA) was evaluated in a comparative manner using a large, high-quality data set, comprising both public and proprietary data (F. Hoffmann-La Roche) from 9,681 compounds tested in the Ames assay. Satisfactory performance statistics were observed on public data (accuracy, 66.475.4%; sensitivity, 65.285.2%; specificity, 53.182.9%), whereas a significant deterioration of sensitivity was observed in the Roche data (accuracy, 73.185.5%; sensitivity, 17.443.4%; specificity, 77.593.9%). As a general tendency, expert systems showed higher sensitivity and lower specificity when compared to QSAR-based tools, which displayed the opposite behavior. Possible reasons for the performance differences between the public and Roche data, relating to the experimentally inactive to active compound ratio and the different coverage of chemical space, are thoroughly discussed. Examples of peculiar chemical classes enriched in false negative or false positive predictions are given, and the results of the combined use of the prediction systems are described.

’ INTRODUCTION The mutagenic potential of a chemical entity is an important safety parameter for pharmaceutical products, not only pertaining to the active pharmaceutical ingredient (API) itself but also to any accompanying substances such as metabolites, impurities, or excipients. The ability of a molecule to cause mutations like frame-shifts or base-pair substitutions is routinely determined by the Ames test.1 This in vitro assay is a key tool for mutagenicity assessment due to its long history, robustness, and well established standard protocol. The Ames test is a substantial part of the official genotoxicity testing package, which has to be performed for every new drug submission (ICH Guideline S2B; Genotoxicity: A Standard Battery for Genotoxicity Testing of Pharmaceuticals) and is routinely conducted by many laboratories all over the world. As it poses a major hurdle in early drug development, a screening version (Ames microsuspension) has been established to increase the throughput and decrease the substance need.2 The Ames test is therefore applied before phase 0 as an official GLP test as well as during lead identification and optimization as part of the screening procedure for the selection of drug candidates. Although the test needs only two days from start to result and is very cost-effective, there is a compelling need to predict its outcome on the basis of the chemical structure even earlier in drug discovery and development before the investigated compounds are synthesized. Furthermore, the CHMP (Committee for Medicinal Products for Human Use) guideline on the limits of genotoxic impurities recommends the use of in silico tools that predict mutagenicity as a first step in hazard identification, r 2011 American Chemical Society

endorsing further evaluations on the basis of the outcome of their predictions. Therefore, several commercial and publicly available in silico tools have been developed over the past years in an effort to predict the outcome of the Ames test. In the present study, we report the results of an objective performance assessment of four programs for mutagenicity prediction, namely, DEREK for Windows (DfW), Leadscope Model Applier (LSMA), MultiCASE (MC4PC), and Toxtree.

’ DATASETS Three data sources have been used as test sets to assess the predictive power of the four programs. The first database, herewith denoted as LSDB, was purchased from Leadscope as part of the FDA SAR Genetox Database 2008. It contains curated and quality-checked data from various public sources. For further information about the origin and composition of the data contained therein, the reader is referred to the documentation.3 Only the Salmonella findings were used for our analysis since these correspond most closely to a standard Ames test. The second source was extracted from a recent paper published by Hansen et al.4 A large fraction of this dataset was already covered by the LSDB and thus omitted. Roche’s proprietary collection of GLP and microsuspension Ames data constituted the third test set. Received: January 26, 2011 Published: May 02, 2011 843

dx.doi.org/10.1021/tx2000398 | Chem. Res. Toxicol. 2011, 24, 843–854

Chemical Research in Toxicology

ARTICLE

for the alerts stored in its knowledge base. These do not capture all compounds which the “neural networks” of the human scientists encoding the DEREK rules have seen, and thus, some optimistic bias can be expected for the performance figures. However, the training set compounds are not encoded by hundreds or thousands of descriptors when alerts are extracted, and therefore, a strong performance gain by mere memorization of the learning set is unlikely to occur. Since the exact training set used at Lhasa is not available to the authors, the most practical approach taken is to exclude molecules from the evaluation which were designated examples. Version 10.0.2 of DEREK for Windows was used for our evaluation. Toxtree. Toxtree is a Java-based, freely available, open-source application for toxicity prediction.911 It was developed by Ideaconsult Ltd. (Sofia, Bulgaria) under the terms of a contract with the European Commission Joint Research Centre. The program is mainly based on structural alerts but also provides QSAR models for distinct chemical classes to refine the predictions. For mutagenicity, Toxtree implements the so-called Benigni-Bossa rulebase12 for carcinogenicity and mutagenicity. The alerts are only differentiated into genotoxic and a small number of nongenotoxic ones, without distinction between carcinogenicity or mutagenicity. Therefore, the genotoxic alerts were used as predictors for the Ames test mutagenicity end point. Additionally, this module offers QSAR models for aromatic amines and R,β-unsaturated aldehydes, which should improve the predictivity for these specific chemical classes. However, the mutagenicity QSARs refer to Salmonella typhimurium TA100 only, and for this reason, we did not take these models into account. Regarding those structures that do not trigger any alert, the same considerations made for DEREK apply. MC4PC (MultiCASE). MultiCASE MC4PC (Multiple Computer Aided Structure Evaluation)1315 is a prediction suite developed by MultiCASE Inc. and belongs to the class of QSAR-based tools. It offers pretrained models for multiple end points as well as the option to derive customized models with user-specific data. MC4PC uses simplified molecular input line entry system (SMILES) codes to reduce training data set organic chemicals to all possible 210 consecutive atom molecular fragments. The program then compares the fragments of active and inactive molecules and identifies those fragments that are associated primarily with active molecules (biophores). The MCASE program then identifies QSAR attributes and/or molecular fragments that are modulators. Modulators are molecular (sub)structure parameters that correlate with enhanced or diminished activity of chemicals that share a common structural alert (e.g., activating fragment, inactivating fragments (biophobes), log P, and graph index). The combination of these data is used to develop a quantitative estimate of the potential toxicity of the test compounds. In order to ensure objectivity, we did not include any manual expert interpretation in the final predictions. Compounds producing an “out of domain” or “bad structure entry” flag were counted as “not predicted” (see Reported Statistics). The present study applies only to the commercially available model A2H (Ames assay) for the prediction of mutagenicity. Test compounds included in the training set of MC4PC were not considered in the statistics. Version 2.1.0.99 of MC4PC was used for this evaluation. Leadscope Model Applier (LSMA). Leadscope Inc. produces several software modules applicable in the context of toxicological forecasting. In this study, we report the results obtained with the Model Applier16 (version 1.2) module using pretrained models from the Genetic Toxicity Suite, which had been developed by Leadscope Inc. in a Collaborative Research and Development

Table 1. Number of Compounds Extracted from Each Database and the Respective Ratio of Ames Negative to Ames Positive Compounds source

number of compounds

ratio neg/pos

LSDB

4,699

1.27

Hansen

2,647

0.49

Roche

2,335

6.78

total

9,681

1.34

All data were prepared for further processing by applying the following steps: Counterions were removed, and acidic/neutral groups were neutralized if possible (i.e., quaternary nitrogens remain charged). Only molecules containing less than 76 nonhydrogen atoms were retained as this threshold was appropriate to filter out proteins and large polymers. Most compounds containing covalently bound metals or other elements rarely occurring in drugs and nutritional products were removed. For each data source, a compound was considered Ames positive if it was reported to be positive in any experiment, regardless of the presence or absence of metabolic activation or of the particular Salmonella strain used. If no positive mutagenicity finding was recorded, the compound was classified as Ames negative. Table 1 gives an overview of these data sources, the number of compounds contained in each source, and the ratio of Ames negative to Ames positive data. In total, 9,681 compounds were used constituting a uniquely large collection of high-quality data, with roughly a quarter of it being represented by proprietary pharmaceutical substances.

’ PREDICTION TOOLS DEREK for Windows (DfW). DEREK58 (Deductive Estima-

tion of Risk from Existing Knowledge) for Windows (DfW, called DEREK throughout the rest of this article) is an expert system based on the recognition of structural alerts (SAs). The alerts are defined by Lhasa Ltd., represent knowledge from literature, academic, and industrial experts, and are regularly updated according to newly available experimental data and recent publications. The program also allows users to define proprietary, customized alerts. DEREK provides predictions for multiple species and end points that are further aggregated to superend points. In general, there is no one-to-one correspondence between a DEREK end point and a specific assay, but for mutagenicity, most alerts are derived from Ames test results. The predictions are derived from a reasoning scheme, which takes into account not only the presence or absence of a structural alert but also the species and a few calculated physicochemical parameters, resulting in the assignment of a so-called “uncertainty term”. However, in our study, we considered structural alerts only, as a comparison showed almost perfect agreement between the presence of an alert and a reasoning-based prediction at least above the “plausible” uncertainty level. Hence, we assigned a positive DEREK prediction to compounds evoking a mutagenicity alert and a negative prediction otherwise. We are well aware that no conclusion can be made for compounds that do not trigger any alert. However, for the sake of a quantitative assessment of the usefulness of these alerts, it is unavoidable to consider the absence of alerts as a negative prediction. Since DEREK is an expert system, it has no training set in a strict sense as in the case of QSAR-based systems, but there are a few examples 844

dx.doi.org/10.1021/tx2000398 |Chem. Res. Toxicol. 2011, 24, 843–854

Chemical Research in Toxicology

ARTICLE

more widely known Cohen’s kappa,1921 but has the advantage that the maximum/minimum achievable value is always þ1/1, regardless of the structure of the respective underlying confusion matrix. It takes into account that a distinct degree of correct predictions will occur just by chance given some marginal distributions of experimentally observed and computationally predicted activities. The exact formula is given by

Agreement (CRADA) with the U.S. FDA. Moreover, the company provides customizable software, which enables the incorporation of proprietary data. In our study, we assessed the model developed for Salmonella mutagenicity prediction. Like MC4PC, the Leadscope software employs a fragment-based QSAR paradigm; however, the fragments are not paths of distinct lengths but are predefined in a hierarchically organized dictionary that is closely related to common organic/medicinal chemistry building blocks. For binary classification problems, such as the Ames test, the algorithm identifies toxicity modulating fragments using a χ2-test. Furthermore, the software is able to build superstructures from smaller fragments if they improve predictivity. Together with eight global molecular properties, the set of fragments is then used as a descriptor set in a partial least squares (PLS) logistic regression model of the activity class. Therefore, the predictions from this algorithm are continuous probabilities of class membership rather than binary outputs. The program also assesses the applicability domain by measuring the distance to training set molecules. For the sake of our performance evaluation, we considered probabilities greater than 0.5 as an “active” prediction and probabilities smaller than 0.5 as “inactive” predictions, which is the standard procedure used by the Model Applier for pretrained models. Compounds which are annotated as “out of domain” or “missing descriptors” were counted as “not predicted”. Test molecules contained in the training set of the model were excluded from the statistics. Reported Statistics. For the calculation of performance statistics, two general approaches are conceivable: The first one is to challenge all prediction systems with the same set of molecules and record the resulting performance metrics. This implies that the systems are tested on different subsets of the same original set because each system has a different training set or molecules that are out of the applicability domain. The second possibility is to use the maximum common subset, for which all four prediction tools return valid predictions and which does not contain molecules from any training set. We decided to adopt the first approach since it is the most informative for a typical application scenario and uses as many molecules as available for an individual system. In order to ensure that the different test sets do not jeopardize the comparability of this evaluation, we also calculated the statistics using the second approach. As the results were very similar, we report only those obtained using the first approach in the interest of conciseness of the article. Standard statistics such as accuracy (concordance), sensitivity, and specificity17 are reported and should always be considered in connection with each other: Accuracy ¼

RIOC ¼

ðFN þ TNÞðTP þ FPÞ þ ðFN þ TNÞðFP þ TNÞ n ¼ ðFN þ TNÞðTP þ FPÞ þ ðFN þ TNÞðFP þ TNÞ maxðTP þ TNÞ  n ðTP þ TNÞ 

where n is the total number of compounds, and max(TP þ TN) is the largest possible number of correct predictions under given fixed marginal totals. RIOC is equal to the Φ-coefficient21 for dichotomous associations normalized by its maximum achievable value. The metrics outlined above apply to a distinct confusion matrix as a result of a classification scheme returning binary predictions (like DEREK, MC4PC, and Toxtree) or a probabilistic paradigm (like LSMA) after a cutoff value (e.g., 0.5) has been chosen. However, in order to enable threshold-independent comparisons and allow for the comparison of a binary predictor with a continuous one at comparable true/false positive rates, a receiver operating characteristic (ROC)17,22 graph is used. This curve plots the true positive rate (i.e., sensitivity) on the y-axis against the false positive rate (1  specificity) on the x-axis at all possible thresholds. Hence, for LSMA predictions a continuous line is obtained, whereas the performances of the other tools appear as a single point since no variable threshold can be chosen. From a practical point of view, not only the ability to correctly predict the outcome of an experiment but also the coverage of the employed predictor is important: For each program, we report the percentage of compounds that have not been predicted due to technical problems (rare cases) or because the application considered the molecules to be out of the applicability domain (majority of the cases). This amount is called “not predicted”. Compounds that belong to the training set of a program were not included in the performance statistics, but they were not counted as “not predicted”. All performance statistics were calculated using the ROCR package23 of the statistical programming environment R24 2.9.0, using customized scripts. Various data manipulation tasks were facilitated using the programming language Python 2.42 and the data flow tool Pipeline Pilot Professional Client25 7.5.2.300.

TP þ TN TP þ FP þ TN þ FN

Sensitivity ¼

TP TP þ FN

Specificity ¼

TN TN þ FP

TotalCorrect  ChanceCorrect MaximumCorrect  ChanceCorrect

’ RESULTS AND DISCUSSION Comparison on the Dataset Level. Table 2 shows the performance metrics obtained for the prediction of the LSDB, Hansen, and Roche datasets. The four programs show a comparable overall accuracy for all three data sets (median: 77.2%) with the one for the Roche data being slightly higher. However, considering the remaining metrics, a strong dependency on the data source is evident. The sensitivity is highest for the Hansen dataset (mean over all programs 77.1%), slightly lower for the LSDB data (71.0%), and significantly lower for the Roche data (33.6%). One important factor leading to this significant discrepancy is the very low amount of Ames positives within a

where TP (TN) are the true positives (negatives), and FP (FN) are the false positives (negatives). Especially for strongly skewed distributions as is the case for our in-house data, where the Ames negatives heavily outweigh the positives, accuracy alone is not an informative measure of performance. Therefore, we always report the ratio of experimental negatives to positives and a metric called RIOC (relative improvement over chance).18 RIOC is closely related to the 845

dx.doi.org/10.1021/tx2000398 |Chem. Res. Toxicol. 2011, 24, 843–854

Chemical Research in Toxicology

ARTICLE

proprietary pharmaceutical database such as the Roche data (negative/positive ratio is about 7) compared to collections of publically available data (ratio negative/positive is about 0.51.3) decreasing the probability to correctly detect the Ames positives just by pure chance. The RIOC metric, however, should take this effect at least partially into account. This value clearly deteriorates as well (mean value for LSDB, 54.4%; Hansen, 45.6%; Roche, 24.0%) suggesting not only that the Roche data set is depleted with Ames positive compounds but also that these molecules have different properties and are more difficult to detect compared to the actives in the public data. A structural analysis of the chemotypes (see Fundamental Discrepancies in the Representation of Chemotypes) confirms this hypothesis. The specificity values show the typical expected trade-off behavior: as the Roche data had the lowest sensitivity, it takes on the highest specificity values (mean: 87.2%), whereas for the public datasets, specificity is somewhat lower (Hansen, 62.5%; LSDB, 77.2%) but still acceptable. Comparison on the Program Level. One remarkable observation, in terms of individual program performance, is that the performance metrics show similar characteristics for programs belonging to the same class of prediction paradigm (QSAR versus expert systems). The accuracy is comparable for the four programs predicting the same dataset. However, the SA-based programs (DEREK, Toxtree) tend to exhibit a higher sensitivity at the cost of a lower specificity, whereas the QSAR tools (LSMA and MC4PC) show the opposite behavior. This is not unexpected since SAs are more conservative by their nature, which does not take into account the effects of the conjoint occurrence of fragments within the same molecule as QSAR tools do. The values reported in Table 2 for LSMA refer to the confusion matrices obtained when a probability threshold of 0.5 is used as the cutoff. This documents the situation in which the software output is taken as is without any user adjustments, and this setup is the main focus of this study. However, the ROC curve allows one to establish the sensitivity of the LSMA model if the threshold is adjusted to yield the same specificity as one of the

binary predictors. In the same way, it is possible to ascertain the specificity of the LSMA model at a cutoff value chosen to yield the same sensitivity as a given binary predictor. Table 3 lists the thresholds that have to be chosen for the LSMA-predicted probability to yield the same sensitivities (specificities) and the resulting Table 3. Performance of the Leadscope MA Model at Adjusted Thresholds Chosen to Yield the Target Value of the Binary Classifiers Given in the First Column on the Left target value [%]a

threshold

DEREK sens 71.7

0.455

71.7

75.3

DEREK spec 78.1

0.510

68.5

78.2

Toxtree sens 78.0

0.332

78.0

66.7

Toxtree spec 70.0

0.358

75.6

70.1

MC4PC sens 65.2

0.561

65.2

80.1

MC4PC spec 82.9

0.652

60.0

82.9

DEREK sens 80.9

0.294

80.8

47.6

DEREK spec 59.1

0.423

72.5

59.1

Toxtree sens 85.2

0.236

85.2

40.1

Toxtree spec 53.1

0.344

76.7

53.1

MC4PC sens 74.6

0.374

74.6

56.4

MC4PC spec 74.0

0.706

55.9

74.0

DEREK sens 43.4 DEREK spec 91.6

0.251 0.419

43.0 20.4

76.5 91.6

Toxtree sens 42.9

0.251

43.0

76.5

Toxtree spec 77.5

0.260

40.9

77.6

MC4PC sens 30.6

0.335

30.4

86.5

MC4PC spec 85.8

0.329

30.4

85.7

adj. sensitivity [%]

adj. specificity [%]

LSDB

Hansen

Roche

a

This column displays the program and its performance value as taken from Table 2 to which the LSMA model is to be compared.

Table 2. Performance Statistics Obtained for the Prediction of the Three Datasetsa RIOC [%]

number of compoundsb

accuracy [%]

sensitivity [%]

specificity [%]

DEREK

75.4

71.7

78.1

49.9

4,633

LSMA

74.0

69.2

77.8

48.3

2,306

not processed [%]c

ratio neg/pos

LSDB 0.15 13.1

1.30 1.27

Toxtree

73.5

78.0

70.0

55.0

4,698

0.04

1.27

MC4PC

71.5

65.2

82.9

64.3

2,018

2.23

0.56

DEREK

73.7

80.9

59.1

41.0

2,630

0.00

0.50

LSMA

66.4

67.8

63.8

36.2

1,921

17.40

0.53

Toxtree MC4PC

74.6 74.4

85.2 74.6

53.1 74.0

46.1 59.0

2,647 1,099

0.00 4.18

0.49 0.30

DEREK

85.5

43.4

91.6

35.0

2,327

0.34

6.89

LSMA

83.6

17.4

93.9

20.2

1,700

26.80

6.39

Toxtree

73.1

42.9

77.5

23.7

2,335

0.00

6.78

MC4PC

78.9

30.6

85.8

17.2

2,284

1.59

7.04

Hansen

Roche

a

The best results are displayed in bold font. b Total number of compounds that were used in the statistics (without training set molecules or those counted as “not predicted”. c Percentage of input molecules not processed for technical reasons or being considered out of domain. For DEREK and Toxtree, this percentage does not correspond to coverage. 846

dx.doi.org/10.1021/tx2000398 |Chem. Res. Toxicol. 2011, 24, 843–854

Chemical Research in Toxicology

ARTICLE

too strongly toward high specificity. DEREK turns out to be clearly superior to the other three packages on the Roche data: It exhibits the highest sensitivity (43.4%, almost the same for Toxtree) and a very good specificity (91.6%), which is significantly higher than that of Toxtree (77.5%). DEREK’s performance mark (cross) is located significantly above the LSMA curve in the ROC plot. To yield DEREK’s sensitivity, the specificity of the LSMA model would have to go down to 76.5%, being almost equivalent to the characteristics of the Toxtree model. Table 2 also reports the percentage of compounds for which no prediction could be obtained (compound either not processed or out of the applicability domain). For the QSAR-based systems LSMA and MC4PC, this figure describes the coverage of the predictor. An unambiguous trend is observed which is conserved over all three data sets: A small to moderately sized amount of compounds was rejected by MC4PC (1.594.18%), mainly because they were considered by the software to be out of domain. Because of the fact that Toxtree uses less refined structural alerts and displays similar sensitivity but lower specificity than DEREK, the program is not included in subsequent analyses in the interest of concision of this article. The LSMA module exhibits the highest number of unpredicted compounds (13.126.8%), with values around 15% for both public datasets and a particularly high fraction (26.8%) for the Roche data, mostly because the software considers the molecules to be out of the applicability domain. Fundamental Discrepancies in the Representation of Chemotypes. Assuming that DEREK’s structural alerts capture commonly known chemotypes associated with mutagenicity, a comparison of the relative frequencies of alerts triggered by each dataset (normalized to the respective number of compounds) can be used to assess fundamental differences in the datasets and show the amount of “obvious” mutagens contained within each data source. The corresponding bar plot is shown in Figure 2. It is apparent that the public datasets trigger many more alerts compared to the Roche in-house data. Aromatic nitro compounds constitute the alert with the highest abundance in the LSDB and Hansen datasets (both about 12%), whereas alkylating agents (mostly alkyl halides) are the alerts with the highest frequency in the Roche data (however, only 1.8% of the dataset). Polycyclic aromatic compounds are among the top three alerts fired within the public datasets (7.2% and 3.1% for Hansen and LSDB, respectively), while this toxophore does not occur in any Roche compound. N-Nitroso compounds are SAs with the fourth highest frequency (6.3%) in the Hansen data but occur rarely in LSDB or Roche data (1.0% and 0.3%). The allylamine alert is a substructure that is more unique to the Roche dataset (1.1% vs 0.06% and 0.07% in LSDB and Hansen, respectively). In summary, the public datasets are richer in commonly known and obvious mutagenic structures in contrast to the proprietary Roche data containing less of these moieties. Polycyclic aromatic and nitro compounds are especially heavily represented in the public datasets, thus explaining the significantly higher sensitivity of DEREK and Toxtree for these datasets. Analysis of False Negatives. The most striking kind of prediction problem encountered in this study are relatively low sensitivity values. This is particularly true for the Roche dataset but also for the LSDB dataset, as significant amounts of Ames positive compounds are not recognized by any of the programs investigated. Therefore, we tried to identify structural motifs that are associated with higher false negative rates. In order to alleviate this undertaking, the whole pooled dataset was subjected to a

Figure 1. Receiver operating characteristic (ROC) plot depicting the performances of the four investigated programs on three data sources. Binary classifiers (DEREK, Toxtree, and MC4PC) appear as single points, whereas classifiers generating continuous scores (LSMA) yield full curves. The dataset is identified by color and the classifier by the shape of the objects. The solid orange line corresponds to a perfect classifier and the dashed gray diagonal to a random predictor.

specificities (sensitivities) as those obtained with the other software packages. Because of the discrete nature of ROC points, it is usually not possible to yield exactly the same values, but the difference between adjusted and target values is always below 0.5%. The red symbols in Figure 1 (LSDB data) are located almost on top of the line of the LSMA performance, indicating that they all perform comparably to this program if the threshold is adjusted accordingly. In addition, the values in Table 2 indicate that the packages do not differ dramatically, with Toxtree showing the highest sensitivity (78.0%) but the lowest specificity (70.0%). Considering the Hansen data (green symbols), MC4PC, DEREK, and Toxtree perform better than the LSMA model since all symbols are located clearly above the green line. For this data set, MC4PC provides the best trade-off (sensitivity, 74.6%; specificity, 74.0%) and significantly outperforms the LSMA model when used with the 0.5 threshold (sensitivity, 67.8%; specificity, 63.8%; see Table 2) or when adjusting the threshold accordingly (adjusted sensitivity, 55.9%; adjusted specificity, 56.4%; see Table 3). The SA-tools (DEREK and Toxtree) have the highest sensitivity, however, with a significant drop in specificity (59.1% and 53.1%, respectively). All four investigated programs suffer a substantial deterioration in performance when applied to the Roche in-house dataset. The ROC graph (blue items in Figure 1) shows that Toxtree and MC4PC results constitute two comparable trade-offs of sensitivity and specificity that can be obtained by choosing respective thresholds for the LSMA model. In particular, it puts the very low sensitivity (17.4%, Table 2) into perspective that is observed for the LSMA model at the default cutoff of 0.5. After adjustment, values almost identical to those for MC4PC are achieved (MC4PC sensitivity, 30.6%; specificity, 85.8%; Table 2; specificity of LSMA when adjusted to yield same sensitivity, 86.5%; Table 3). Thus, the discriminatory power of both programs is identical, but the behavior of the LSMA model when applied as is is clearly biased 847

dx.doi.org/10.1021/tx2000398 |Chem. Res. Toxicol. 2011, 24, 843–854

Chemical Research in Toxicology

ARTICLE

Figure 2. Relative frequencies (%) of DEREK mutagenicity structural alerts (SA) for the three investigated data sets (LSDB, red; Hansen, green; Roche, blue). The x-axis displays the SA number (DEREK numbering). The y-axis is truncated at about 7.5% in order to increase the resolution.

clustering algorithm developed in-house that emphasizes the recognition of larger scaffolds (Maximum Common Substructure, MCS) and common building blocks arranged in similar topology (Maximum Overlapping Set, MOS). Details can be found in the corresponding reference.26 A prediction status (true/false positive/negative) was assigned to each compound for each predictor. Groups of compounds were identified that had at least two valid predictions of which at least half of these were false negative predictions. It was checked for prevalent clusters among the thus identified subgroups. Subsequently, the whole database (irrespective of prediction status) was searched either by simple substructure search or by more generic SMARTS27 pattern matching, depending on the complexity of the structural motif characterizing the respective cluster. The following descriptions should be seen as a list of some significant examples, giving the reader some insight into the more problematic chemotypes. It is not a comprehensive enumeration of all challenging classes nor is it an attempt to thoroughly discuss elaborate structuretoxicity relationships around the identified motifs. Their chemical names are used to describe the characteristics of a cluster; however, it is not implied that the respective class is generally associated with or devoid of mutagenicity. One group of false negatives consists of polycyclic aromatic compounds containing at least three fused aromatic rings. As Table 4 indicates, the occurring sensitivity problem is specific to DEREK. The ability of DEREK to recognize about three-quarters of the Ames positives shows that the class is covered to a large extent; however, since 85% of the class members are experimentally positive, an expansion of the scope of the alert might be appropriate. Another cluster (including 537 compounds) comprising

a group of false negatives is characterized by small size and low complexity (one benzene ring) and bearing a hydroxyl, amino, or anilide motif. Only about 40% of the members of this class are Ames positive; therefore, an unselective flagging would not be appropriate. The QSAR-based predictors are slightly more sensitive to this class, at an acceptable cost of specificity. The problems associated with this structural class are presumably at least partially related to the known problems with aromatic amines. The complexity of the differential mutagenicity of this class is obviously better represented by a (favorably local) QSARmodel rather than by very complex structural patterns. Quite a unique class of missed mutagens is represented by a series of Roche compounds bearing a 4-benzylpiperidine moiety. Nearly one-half of the 46 molecules belonging to this class shows mutagenic activity in the experiment. None of the molecules is predicted to be active by any of the three predictors. Another cluster enriched in false negative predictions from QSAR-based tools is characterized by nonaromatic compounds bearing a chlorine, bromine, or iodine substituent on a double-bonded or saturated carbon atom (i.e., small halogenated alkanes or alkenes). Approximately half of the compounds belonging to this very generic class exhibit mutagenic activity. The sensitivity of MC4PC for this group is 42%, whereas LSMA recognizes only about 7%. When analyzing the probabilities of the LSMA output, it is striking that many of these false negatives receive scores closer to the 0.5 threshold than to 0.0. This indicates that the QSAR model assigns “elevated” coefficients to these features detecting some signal, which is, however, in most cases not strong enough to push the probabilities beyond the 0.5 threshold. Of course, an adjustment of the threshold can be helpful 848

dx.doi.org/10.1021/tx2000398 |Chem. Res. Toxicol. 2011, 24, 843–854

Chemical Research in Toxicology

ARTICLE

Table 4. Selected Structural Classes Prevalent with False Negative Classifications Given by One or More Predictors structural class (representatives)a polyaromatic

frequency [%]b

Amesþ [%]

11.54 (1,117/9,681) 84.60 (945/1,117)

compounds (1, 2) phenol/aniline/

5.55 (537/9,681)

39.29 (211/537)

benzamide (3, 4) 4-benzyl-piperidines (5)

0.48 (46/9,681)

alkyl/alkenyl halides (6, 7) 3.22 (312/9,681)

azides (8)

octahydrobenzo[f] benzoquinolines (9)

0.92 (89/9,681)

0.20 (19/9,681)

43.48 (20/46)

51.60 (161/312)

94.38 (84/89)

73.68 (14/19)

predictor

sensitivity [%]

specificity [%]

accuracy [%]

excluded [%]c

DEREK

77.47 (729/941) 53.22 (91/171)

73.74 (820/1,112) 0.45 (5/1,117)

LSMA

96.63 (315/326) 21.18 (18/85)

81.02 (333/411)

MC4PC

90.92 (471/518) 39.34 (24/61)

86.69 (495/571)

DEREK

58.54 (120/205) 85.71 (276/322) 75.14 (396/527)

1.86 (10/537)

LSMA

60.27 (44/73)

79.23 (145/183) 73.83 (189/256)

52.33 (281/537)

MC4PC

65.29 (79/121)

87.96 (95/108)

75.98 (174/229)

57.36 (308/537)

DEREK LSMA

0.00 0.00

100.00 (26/26) 100.00 (24/24)

56.52 (26/46) 55.81 (24/43)

0.00 6.52 (3/46)

MC4PC

0.00

100.00 (26/26)

56.52 (26/46)

0.00

DEREK

90.61 (193/213) 20.21 (19/94)

69.06 (212/307)

1.60 (5/312)

LSMA

6.52 (3/46)

29.51 (18/61)

80.45 (251/312)

100.00 (15/15)

63.21 (706/1,117) 48.16 (538/1,117)

MC4PC

42.25 (30/71)

85.71 (24/28)

54.55 (54/99)

68.27 (213/312)

DEREK

100.00 (84/84)

0.00 (0/5)

94.38 (84/89)

0.00

LSMA

17.50 (14/80)

80.00 (4/5)

21.18 (18/85)

4.49 (4/89)

MC4PC DEREK

30.00 (21/70) 7.14 (1/14)

80.00 (4/5) 100.00 (5/5)

33.33 (25/75) 31.58 (6/19)

15.73 (14/89) 0.00

LSMA

7.14 (1/14)

100.00 (5/5)

31.58 (6/19)

0.00

MC4PC

11.11 (1/9)

50.00 (2/4)

23.08 (3/13)

31.58 (6/19)

a

Numbers in parentheses refer to the structures shown in Chart 1. b Percentage of compounds belonging to the respective chemical class within the pooled dataset. c Quantifies the amount of compounds within a structural group that are not predicted by the respective software or that were excluded due to training set membership.

(as demonstrated by Table 3) and leads to some improvement. We will use this example later to give a more detailed explanation for the reason for this partial toxophore recognition and explain possible ways to solve this problem (see Conclusions). Most of the organic azides present in the database are Ames positive (84/89); therefore, a broad, unspecific alert as implemented in DEREK seems a viable approach. Whereas this alert captures all azides in the database, both LSMA and MC4PC are not able to recognize even a third of them. Although they have a high specificity compared to DEREK, the overall accuracy for this functional group is low. A small group of octahydrobenzo[f]quinolines is not detected by any of the predictors. Almost 75% of them are Ames positive, but each program correctly identifies only one molecule, which is not the same in all applications. The one compound that DEREK identifies is due to its bromomethylene group. Analysis of False Positives. Suboptimal specificity turned out to be an issue of less concern considering the performance statistics in Table 2. However, for the Hansen dataset, this metric reaches very low values for all software packages except MC4PC. For DEREK, the most concise way of analyzing the source of false positive predictions is to record the number of Ames negatives flagged by one alert divided by the number of all compounds flagged by this alert. Table 5 shows the 15 alerts with the highest false positive rates, excluding those which occur in less than 10 molecules. Ideally, an alert would flag 100% Ames positive compounds. Alerts with a false positive rate above 58% are particularly suspicious for our study since a randomly drawn sample would already (on average and for large sample sizes) reach this level (42.4% Ames positives in the whole database). However, even alerts with high false positive rates might be justified if they are triggered very rarely and serve as a hint toward potential risks with a certain structural motif. Other alerts might

Chart 1. Representatives of Structural Classes Enriched with False Negative Predictions

be “purer” but compromise the overall specificity by their high absolute occurrence. The top three alerts with the highest absolute frequency of false positives are the aromatic nitro, alkylating agent, and epoxide alerts (15.4%, 28.8%, and 32.9% false positive rates, respectively). Because of their high abundance, polyaromatic compounds appear also as a group contributing a large absolute number of 849

dx.doi.org/10.1021/tx2000398 |Chem. Res. Toxicol. 2011, 24, 843–854

Chemical Research in Toxicology

ARTICLE

false positive predictions, although the alert itself is quite “pure” (i.e., has a low false positive rate). As pointed out in the previous section, the vast majority of this chemotype is Ames positive, and from a pharmaceutical perspective, this group is not of special interest, and thus, the unselective setup of this alert appears to be justified in the intended context of use. In order to further investigate false positive examples with respect to LSMA and MC4PC, for which a simple counting exercise of alerts is not possible, an analysis similar to that described for the false negatives using clustering and substructure searching techniques was carried out. The results are shown in Table 6. More than 80% of aromatic nitro compounds are Ames positive, and therefore, a relatively unselective flagging seems an appropriate choice. However, in contrast to the polyaromatic scaffold, some important marketed drugs contain nitro groups,

and also, their occurrence as an impurity in active pharmaceutical ingredients is of high interest. MC4PC offers here the best discriminative power. Molecules containing a quinoline moiety (without further ring fusions) constitute a challenging case in terms of specificity, especially for the LSMA model but, to some extent, also for DEREK. A small group of 21 Roche compounds with a 2,5-diamino-4-phenylpyridine scaffold is completely inactive. DEREK (one false positive prediction due to a hydroxylamine alert) and the LSMA predictor have no prediction problem within this group; however, MC4PC flags all of them as positive. A furan ring is contained in 114 molecules in the database tested, with 64 of them containing no nitro group. Only 14% of these compounds are active, and the LSMA model clearly overpredicts them. DEREK and MC4PC behave more selectively, showing a decrease in sensitivity. Agreement between the Applications. In the following section, all three pairs of DEREK, LSMA, and MC4PC results are compared for their mutual prediction agreement. The number and identity of compounds differ from pair to pair, since only compounds yielding valid predictions for both predictors under consideration can be used for this type of analysis. Table 7 gives the conditional probabilities for all three possible pairwise comparisons of predictors. The conditional probability of Aþ given Bþ (denoted P(Aþ|Bþ)) is calculated by dividing the number of compounds predicted as positive by predictors A

Table 5. List of the Top 15 DEREK Alerts Sorted by Their False Positive Ratesa alert

false positive

numberb

rate [%]

049

73.8 (31/42)

oxime

341

72.7 (8/11)

pyrroline ester, pyrroline N-oxide ester, pyrrole ester or pyrrole alcohol

203

70.0 (7/10)

flavonol

476

64.5 (20/31)

allylamine

028

63.2 (12/19)

mono- or dialkylhydrazine

026

61.5 (8/13)

N-haloamine

305

61.1 (44/72)

alkyl ester of phosphoric or phosphonic acid

491

58.3 (7/12)

N-amino heterocycle

012 050

57.9 (22/38) 57.1 (20/35)

aliphatic nitro compound peroxide, peracid or perester

325

52.9 (9/17)

aromatic azoxy compound

574

50.0 (5/10)

furan derivative

583

46.2 (6/13)

benzimidazole

304

45.0 (9/20)

isocyanate or isothiocyanate

471

40.9 (9/22)

quinolone-3-carboxylic acid

alert description

Chart 2. Representatives of Structural Classes Enriched with False Positive Predictions

or naphthyridine analogue a

Only alerts which were triggered by at least 10 molecules have been considered. b DEREK numbering system.

Table 6. Selected Structural Classes Prevalent with False Positive Predictions Given by One or More Predictors structural class (representatives)a nitro aromatic compounds (10)

quinolines (11)

amount in database [%]

predictor

11.52 (1018) 82.61 (841/1018) DEREK

1.80 (174)

2,5-diamino-4-phenylpyridines, (12) 0.22 (21)

furan w/o nitro (13, 14)

Amesþ [%]

0.66 (64)

38.51 (67/174)

0.00

14.06 (9/64)

sensitivity [%]

specificity [%]

accuracy [%]

excluded [%]b

99.40 (834/839) 13.14 (23/175) 84.52 (857/1014) 0.39 (4/1018)

LSMA

94.76 (434/458) 8.73 (11/126)

MC4PC

96.43 (810/840) 37.85 (67/177) 98.86 (867/877)

76.20 (445/584)

42.63 (434/1018) 0.10 (1/1018)

DEREK

82.81 (53/64)

52.34 (56/107) 63.74 (109/171)

1.72 (3/174)

LSMA

93.10 (27/29)

29.67 (27/91)

45.00 (54/120)

31.03 (54/174)

MC4PC DEREK

43.75 (14/32) NA

75.93 (41/54) 95.24 (20/21)

63.95 (55/86) 95.24 (20/21)

50.57 (88/174) 0.00

LSMA

NA

100.00 (2/2)

100.00 (2/2)

90.48 (19/21)

MC4PC

NA

0.00

0.00

0.00

DEREK

55.56 (5/9)

70.39 (38/54)

68.25 (43/63)

1.56 (1/64)

LSMA

85.71 (6/7)

32.65 (16/49)

39.29 (22/56)

12.00 (8/64)

MC4PC

16.67 (1/6)

83.33 (25/30)

72.22 (26/36)

43.75 (28/64)

a

Numbers in parentheses refer to the structures shown in Chart 2. b Quantifies the amount of compounds within a structural group that are not predicted by the respective software or that were excluded due to training set membership. 850

dx.doi.org/10.1021/tx2000398 |Chem. Res. Toxicol. 2011, 24, 843–854

Chemical Research in Toxicology

ARTICLE

Table 7. Agreement among DEREK, LSMA, and MC4PC Predictionsa software combination (A/B) P(Aþ|Bþ) [%] P(Bþ|Aþ) [%] P(A|B) [%] P(B|A) [%] total agreement [%] RIOC [%] number of compounds LSDB DEREK/MC4PC

86.1

70.8

67.5

84.1

76.4

66.8

1,999

DEREK/LSMA

79.0

71.5

76.5

83.0

77.6

60.3

2,288

LSMA/MC4PC

81.0

73.2

73.7

81.4

77.1

60.4

1,333

DEREK/MC4PC

90.2

76.3

Hansen 50.4

74.4

76.0

60.8

1,095

DEREK/LSMA

82.2

70.4

54.8

68.9

69.8

45.2

1,910

LSMA/MC4PC

85.6

81.7

68.7

74.5

79.2

58.8

816

DEREK/MC4PC

27.0

36.2

90.8

86.5

80.4

23.8

2,276

DEREK/LSMA

43.0

23.0

88.3

95.0

84.8

33.6

1,694

LSMA/MC4PC

19.1

39.1

94.4

86.2

82.6

27.7

1,662

Roche

a

For each dataset, the conditional probabilities are shown for all three pair-wise software combinations (for details, see text). A plus sign (þ) denotes a positive prediction and a minus sign () an Ames inactive prediction. Furthermore, the number of compounds cross-classified by both software packages as well as the RIOC metric as an overall measure of agreement is given.

and B by the number of compounds predicted positive by predictor B. Thus, it indicates how many of the compounds predicted as positive by model B will also be detected by model A. Definitions for the inverse and for negative predictions are analog. In general, the percentage of overall agreement is quite high ranging from 70% to 85%. However, as for the comparison to the experimental outcomes, chance effects have to be taken into account when the marginal distributions are strongly skewed. Especially for the Roche data, a high chance contribution to the agreement is unraveled by a low RIOC metric compared to the actual overall agreement rate. For the LSDB and Hansen data, DEREK detects more of the compounds predicted as positive by the QSAR tools than the other way around. An opposite trend is found for agreement on negative predictions. DEREK predicts only about half of the Hansen compounds predicted by LSMA or MC4PC as inactive. For the Roche data, the concordance in terms of positive predictions is quite low: between 19.1% and 43.0%. The high overall agreement for this data set originates almost exclusively from the agreement about negative predictions. Combined Use of Predictors. Combining the prediction of more than one predictor appears to be an option to merge the capabilities of several classifiers and has also been suggested for mutagenicity prediction.28,29 The agreement analysis suggests that the programs differ to a significant extent in their actual predictions, and thus, a combination of them might lead to better performance. The two simplest conceivable ways to combine two or more classifiers are a logical OR- or an AND-combination of the individual predictions. An AND-combination (a positive combined prediction is assigned only when all individual predictors give a positive prediction) would reduce the coverage since all tools must return a valid prediction. Furthermore, this kind of combination would increase specificity and reduce sensitivity because it becomes harder to achieve a positive prediction. As mentioned above, sensitivity is the major bottleneck for the performance; therefore, only the ORcombination was investigated. A combined positive prediction is assigned if any single prediction is positive, irrespective of the fact that one or more tools might not contribute a vote. From the very nature of this approach, it is expected that sensitivity and coverage will increase and specificity will decrease by the OR-combination. No boost of discrimination power can be expected; therefore, the

results have to be judged in terms of overall accuracy and RIOC improvement. The resulting performance profile has to be checked for improvement with respect to the particular needs of the desired application. Considering the results shown in Table 8, it is interesting to see that the RIOC metric takes on its highest value for the triple combination and its lowest one for the LSMA/MC4PC combination. This correlates with the sensitivity and is not surprising, taking the results for the single predictors into account. The overall accuracy remains virtually constant for all datasets when the best results for the single predictors are compared to the best possible combinations, respectively. For the Roche dataset, this approach slightly improves the sensitivity without compromising the specificity to unacceptably low levels. Relationship to Other Publications. Several publications describe the performance assessment of computational toxicology tools for bacterial assay mutagenicity prediction. This section briefly puts our results in the context of other publications without any intention for a complete review of the existing literature on the subject. Cariello et al.30 report an assessment of DEREK (version number incompatible with the current system used; the paper is from 2002) on 409 proprietary pharmaceutical compounds (GlaxoSmithKline) exhibiting 20% Ames positives (compared to 12.9% for the Roche data). They observed a very similar sensitivity value (46.3%) but a much worse specificity (69.1%) resulting in an overall diminished accuracy (64.5%, the recalculated RIOC metric is 18.7%, almost half the value of our study). A study by Snyder et al.28,31 examined the prediction of 525 marketed drugs (7.1% Ames positives) using DEREK (versions 5.0 and 10.0) and MCASE/MC4PC (modules A2I/A2H, versions 3.46/2.0). The accuracy of DEREK was similar, but the sensitivity was clearly higher (61.5%) compared to our results obtained on the Roche data. They also recorded a higher sensitivity (44.7%) and specificity (97.1%) for MCASE/MC4PC. However, it needs to be considered that no claim was made that compounds contained in the training sets were excluded from the performance assessment in their study, which might lead to an optimistic bias. This publication also reports the effects of an OR-combination of both programs that resulted in a considerable improvement in sensitivity up to 76.9%, but at an expense of specificity, which is not 851

dx.doi.org/10.1021/tx2000398 |Chem. Res. Toxicol. 2011, 24, 843–854

Chemical Research in Toxicology

ARTICLE

Table 8. Performance Statistics of the Four Possible OR-Combinations of DEREK (D), LSMA (L), and MC4PC (M) Models for the Three Data Sourcesa software combination (OR)

RIOC [%]

number of compoundsb

not processed [%]c

ratio neg/pos

74.7

56.4

4,651

1.02

1.29

75.3 66.7

77.7 78.7

54.8 50.5

4,645 2,755

1.15 41.37

1.29 1.09

78.6

74.5

58.3

4,655

0.94

1.29

accuracy [%]

sensitivity [%]

specificity [%]

D/L

75.8

77.3

D/M L/M

76.7 73.0

D/L/M

76.3

LSDB

Hansen D/L

76.0

88.2

51.1

52.5

2,642

0.23

0.49

D/M

74.7

82.7

58.4

44.1

2,632

0.57

0.50

L/M

68.8

70.7

65.1

40.3

2,111

20.25

0.52

D/L/M

76.3

88.9

50.7

54.2

2,641

0.23

0.49

D/L D/M

83.4 81.5

47.5 45.8

88.7 86.7

37.5 34.4

2,333 2,335

0.09 0.00

6.78 6.76

L/M

81.6

21.7

90.7

13.5

2,282

2.27

6.87

D/L/M

79.8

50.5

84.1

37.9

2,335

0.00

6.76

Roche

a

The best metric is marked in bold font. b Total number of compounds used in the statistics (see text for details). c Percentage of input molecules not processed for technical reasons or for being considered out of domain (only cases for which none of the combined tools gives a valid prediction). This corresponds only to a measure of coverage for the L/M combination.

further indicated. The publication by Greene et al.32 contains a very informative evaluation of DEREK v4.01 and MultiCASE v3.45 applied to 974 compounds of Pfizer’s proprietary Ames test database (9% positives). Their findings on MultiCASE’s performance are almost identical to ours, whereas for DEREK, only the sensitivity is very similar (45%); the specificity (62%), and thus the accuracy (60%), is much lower compared to those in our study, which is in line with the results from Cariello’s evaluation. This indicates that the sensitivity on truly novel compounds remained almost unchanged over several versions of DEREK, whereas the specificity has strongly improved considering our study. A very recent review33 by the Pfizer group compares the performance of DEREK and the ACT/Tox suite on the full Hansen dataset and two proprietary sets for mutagenicity prediction. Their ROC graph shows a satisfactory performance of DEREK for this dataset (sensitivity, about 77%; specificity, about 73%), the trade-off being more toward higher sensitivties compared to the results we obtained for the subset containing only approximately 40% of the full set. On two external datasets, they observed similarly high specificities (about 95% and 92%, respectively) but significantly lower sensitivites for DEREK (about 17% and 27%) than we did on the Roche data. Thus, the trend is strongly confirmatory. Unfortunately, neither the exact DEREK version number (the paper is from 2010) nor details about size and composition of the external datasets are given. White et al.29 applied DEREK (v4.01) and CASETOX (providing the same predictions as MC4PC if pretrained modules are used, module A2I v.3.43) to two test sets consisting of 520 Pharmacia in-house and 94 commercially available compounds (14.6% and 31.9% Ames positives, respectively), which were not part of the training set of CASETOX. For DEREK, they observed even lower sensitivities (28%) and specificities (80%) on their inhouse data compared to our results. Better performance was found for the commercial dataset (sensitivity, 63%; specificity, 81%). For CASETOX, they obtained a much higher sensitivity (50%) and the same specificity on the Pharmacia data as we did on the Roche data. However, their application led to a strongly reduced coverage, with

59% of the compounds not predicted. In our analysis, we had to reject less than 2% of the Roche compounds due to ambiguous output of MC4PC. The authors also describe their results with a combination of the tools, requiring valid predictions and full agreement on each prediction. The specificity improved slightly compared to DEREK and CASETOX as single predictors, but the sensitivity remained almost as low as that when using DEREK alone for the proprietary data. At the same time, the coverage was reduced to an unacceptably low level below one-third of all compounds. An extensive investigation on public Ames data was recently conducted by Hansen et al.4 using 6,512 compounds (54% Ames positives), part of which went into our study. Their focus was more on using DEREK (v10.0.2) and MultiCASE (AZ2 module, v2.1) as benchmarks for other trainable classification algorithms. For the subset of their data, which went into our evaluation, we observed somewhat better results, although within a similar range.

’ CONCLUSIONS This article emphasizes the distinction between public and proprietary pharmaceutical Ames test mutagenicity data in terms of chemotype composition and activity distribution. These discrepancies are directly reflected in the performance statistics of the four investigated programs. The performance statistic with the highest need for improvement was sensitivity for the Roche data. This can be explained by the peculiar scaffolds that are not represented in any public data set and therefore cannot be recognized by the investigated software packages, which is, to some extent, intrinsic to the fragment-based molecular description they employ. Nevertheless, it is notable that DEREK, despite its algorithmic simplicity, performed best in terms of a sound sensitivity/specificity trade-off. Surprisingly, the QSAR-based predictors, LSMA and MC4PC, missed some well-known mutagens, which could be easily detected by a low complexity structural alert. This deficiency is partially caused by the use of predefined fragment dictionaries (Leadscope) or fragmentation algorithms unaware of chemically reasonable cutting points (MC4PC). This 852

dx.doi.org/10.1021/tx2000398 |Chem. Res. Toxicol. 2011, 24, 843–854

Chemical Research in Toxicology is the trade-off when fully automated data mining procedures are applied in order to save the user the tedious and time-consuming work of manually going through hundreds of molecules in an attempt to find comprehensive rules. As shown for the alkyl/ alkenyl halide problem, the neglect of the chemical environment of this extremely small and generic group leads to a dilution of its signal, averaging the associated QSAR weights over a variety of different electronic and global molecular contexts. We believe that a true fusion of empirical expert knowledge, as represented in DEREK, with QSAR-based techniques could lead to significant improvements. One conceivable option is to use known structural alerts to guide the fragmentation process for subsequent QSAR analyses. In this way, the mutagenic activity to be modeled would be partitioned to toxicologically more meaningful fragments. Such an approach can be seen as a way to derive local QSARs (such as the Benigni-Bossa rules are, for example). As useful as the concept of structural alerts is, it is still impossible to explain the differential activities of aromatic amines and other metabolically activated chemical classes by substructure patterns alone. Further research is necessary to gain a deeper understanding of chemical reactivity and metabolic activation processes.34 Hence, descriptors of chemical reactivity and global molecular properties have to be used complementarily to the fragment patterns if improvements are to be achieved that go beyond adding newly identified groups to the alert dictionary. Another important and still unsolved problem is the definition of the applicability domain35,36 of QSAR models. The QSAR tools applied in this study rejected a significant amount of compounds due to their distance from the training space. While the implemented metrics make sense from a mathematical or chemoinformatic point of view as a measure of extrapolation (e.g., chemical similarity to nearest neighbors or the centroid of the training space), the connection to predictivity is less clear. Additionally, it is important to develop applicability domain measures for expert systems in order to resolve the problem mentioned above, that a lack of an alert does not translate into a negative prediction. The use of in silico tools for the prediction of genotoxic potential has gained utmost importance during the past years in light of the new guidelines/regulations for the assessment of potential genotoxic impurities in pharmaceuticals. The results of our studies showed that the combined use of an expert system like DEREK with a QSAR-based technique like Leadscope Model Applier or MC4PC can lead to improvements over the use a single prediction tool, although the major contribution to the predictive power comes from the expert system. In summary, the investigated tools for Ames test mutagenicity prediction provide valuable instruments to support expert chemists and toxicologists in identifying mutagens and for prioritizing testing, but the user has to be aware of their particular scopes and limitations. Because of the large number and high quality of Ames test data, we are convinced that further improvements in this area are possible.

ARTICLE

’ REFERENCES (1) Ames, B. N., Lee, F. D., and Durston, W. E. (1973) An improved bacterial test system for the detection and classification of mutagens and carcinogens. Proc. Natl. Acad. Sci. U.S.A. 70, 782–786. (2) Kado, N. Y., Langley, D., and Eisenstadt, E. (1983) A simple modification of the Salmonella liquid-incubation assay. Increased sensitivity for detecting mutagens in human urine. Mutat. Res. 121, 25–32. (3) Leadscope Inc., FDA SAR Genetox Database Documentation. http://www.leadscope.com/ls-manuals/downloads/FDAToxicityDatabaseDocumentation.pdf (accessed at November 21, 2009). (4) Hansen, K., Mika, S., Schroeter, T., Sutter, A., ter Laak, A., Steger-Hartmann, T., Heinrich, N., and Muller, K. R. (2009) Benchmark data set for in silico prediction of Ames mutagenicity. J. Chem. Inf. Model. 49, 2077–81. (5) Sanderson, D. M., and Earnshaw, C. G. (1991) Computer prediction of possible toxic action from chemical structure; the DEREK system. Hum. Exp. Toxicol. 10, 261–73. (6) Greene, N., Judson, P. N., Langowski, J. J., and Marchant, C. A. (1999) Knowledge-based expert systems for toxicity and metabolism prediction: DEREK, StAR and METEOR. SAR QSAR Environ. Res. 10, 299–314. (7) Judson, P. N. (2006) Using Computer Reasoning about Qualitative and Quantitative Information to Predict Metabolism and Toxicity, in Pharmacokinetic Profiling in Drug Research: Biological, Physicochemical, and Computational Strategies (Testa, B., Kramer, S. D., WunderliAllespach, H., and Volkers, G., Eds.) pp 183215, Wiley, New York. (8) DEREK for Windows, version 10.0.2; 2223, Lhasa Ltd., Leeds, UK. http://www.lhasalimited.org (accessed Dec 16, 2010). (9) European Commission Research Centre Computational Toxicology Group, Toxtree, version 1.60. http://ecb.jrc.ec.europa.eu/qsar/ qsar-tools/index.php?c=TOXTREE (accessed Dec 16, 2010). (10) Lahl, U., and Gundert-Remy, U. (2008) The use of (Q)SAR methods in the context of REACH. Toxicol. Mech. Methods 18, 149–158. (11) Pavan, M., and Worth, A. P. (2008) Publicly-accessible QSAR software tools developed by the Joint Research Centre. SAR QSAR Environ. Res. 19, 785–799. (12) Benigni, R., and Bossa, C. (2008) Structure alerts for carcinogenicity, and the Salmonella assay system: a novel insight through the chemical relational databases technology. Mutat. Res. 659, 248–261. (13) MultiCASE, version 2.1.0.99, module A2H, MultiCASE Inc., Beachwood, OH. http://www.multicase.com (accessed Dec 16, 2010). (14) Klopman, G. (1984) Artificial intelligence approach to structure-activity studies. Computer automated structure evaluation of biological activity of organic molecules. J. Am. Chem. Soc. 106, 7315–7321. (15) Klopman, G. (1992) MULTICASE 1. A hierarchical computer automated structure evaluation program. Quant. Struct.-Act. Relat. 11, 176–184. (16) Leadscope Model Applier, version 1.2, Salmonella model, Leadscope Inc., Columbus, OH. http://www.leadscope.com (accessed Dec 16, 2010). (17) Witten, I. H., and Frank, E. (2005) Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed., Morgan Kaufmann, San Francisco, CA. (18) Copas, J. B., and Loeber, R. (1990) Relative improvement over chance (RIOC) in 2  2 tables. Br. J. Math. Stat. Psychol. 43, 293–307. (19) Cohen, J. (1960) A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20, 37–46. (20) Lantz, C. A., and Nebenzahl, E. (1996) Behavior and interpretation of the kappa statistic: resolution of the two paradoxes. J. Clin. Epidemiol. 49, 431–4. (21) Bortz, J. (2005) Statistik fuer Human- und Sozialwissenschaftler, 6 ed., Springer, Berlin, Germany. (22) Fawcett, T. (2003) ROC graphs: notes and practical consideration for data mining researchers. HP Tech. Rep. 27. (23) Sing, T., Sander, O., Beerenwinkel, N., and Lengauer, T. ROCR: Visualizing the Performance of Scoring Classifiers, version 1.04. http:// rocr.bioinf.mpi-sb.mpg.de/ (accessed Oct 21, 2009).

’ AUTHOR INFORMATION Corresponding Author

*F. Hoffmann-La Roche Ltd., Non-Clinical Safety, Structure Property Effect Relationships Group, Grenzacherstrasse 124, CH-4070 Basel, Switzerland. E-mail: alexander.hillebrecht@ roche.com.

’ ACKNOWLEDGMENT James Zimmerman is acknowledged for reviewing the manuscript. 853

dx.doi.org/10.1021/tx2000398 |Chem. Res. Toxicol. 2011, 24, 843–854

Chemical Research in Toxicology

ARTICLE

(24) R Development Core Team. R: A Language and Environment for Statistical Computing, version 2.9.0, Vienna, Austria. http://www. R-project.org (accessed Dec 16, 2010). (25) Pipeline Pilot Professional Client, version 7.5.2.300, Accelrys, Inc., San Diego, CA. http://accelrys.com/products/pipeline-pilot/ (accessed Dec 16, 2010). (26) Stahl, M., Mauser, H., Tsui, M., and Taylor, N. R. (2005) A robust clustering method for chemical structures. J. Med. Chem. 48, 4358–4366. (27) Daylight Chemical Information Systems, Inc., 4. SMARTS: A language for describing molecular patterns. http://www.daylight.com/ dayhtml/doc/theory/theory.smarts.html (accessed Jul 07, 2010). (28) Snyder, R. D. (2009) An update on the genotoxicity and carcinogenicity of marketed pharmaceuticals with reference to in silico predictivity. Environ. Mol. Mutagen. 50, 435–450. (29) White, A. C., Mueller, R. A., Gallavan, R. H., Aaron, S., and Wilson, A. G. (2003) A multiple in silico program approach for the prediction of mutagenicity from chemical structure. Mutat. Res. 539, 77–89. (30) Cariello, N. F., Wilson, J. D., Britt, B. H., Wedd, D. J., Burlinson, B., and Gombar, V. (2002) Comparison of the computer programs DEREK and TOPKAT to predict bacterial mutagenicity. Mutagenesis 17, 321–329. (31) Snyder, R. D., Pearl, G. S., Mandakas, G., Choy, W. N., Goodsaid, F., and Rosenblum, I. Y. (2004) Assessment of the sensitivity of the computational programs DEREK, TOPKAT, and MCASE in the prediction of the genotoxicity of pharmaceutical molecules. Environ. Mol. Mutagen. 43, 143–158. (32) Greene, N. (2002) Computer systems for the prediction of toxicity: an update. Adv. Drug Delivery Rev. 54, 417–431. (33) Naven, R. T., Louise-May, S., and Greene, N. (2010) The computational prediction of genotoxicity. Expert Opin. Drug Metab. Toxicol. 6, 797–807. (34) Leach, A. G., Cann, R., and Tomasi, S. (2009) Reaction energies computed with density functional theory correspond with a whole organism effect; modelling the Ames test for mutagenicity. Chem. Commun. 1094–1096. (35) Netzeva, T. I., Worth, A., Aldenberg, T., Benigni, R., Cronin, M. T., Gramatica, P., Jaworska, J. S., Kahn, S., Klopman, G., Marchant, C. A., Myatt, G., Nikolova-Jeliazkova, N., Patlewicz, G. Y., Perkins, R., Roberts, D., Schultz, T., Stanton, D. W., van de Sandt, J. J., Tong, W., Veith, G., and Yang, C. (2005) Current status of methods for defining the applicability domain of (quantitative) structure-activity relationships. The report and recommendations of ECVAM Workshop 52. Altern. Lab. Anim. 33, 155–173. (36) Weaver, S., and Gleeson, M. P. (2008) The importance of the domain of applicability in QSAR modeling. J. Mol. Graphics Modell. 26, 1315–1326.

854

dx.doi.org/10.1021/tx2000398 |Chem. Res. Toxicol. 2011, 24, 843–854