Assessment of Methods To Define the Applicability Domain of

ARTICLE pubs.acs.org/jcim

Assessment of Methods To Define the Applicability Domain of Structural Alert Models C. M. Ellison,*,† R. Sherhod,‡ M. T. D. Cronin,† S. J. Enoch,† J. C. Madden,† and P. N. Judson§ †

School of Pharmacy and Biomolecular Sciences, Liverpool John Moores University, Byrom Street, Liverpool L3 3AF, England Department of Information Studies, University of Sheffield, Regent Court, Sheffield S1 4DP, England § Lhasa Limited, 22-23 Blenheim Terrace, Woodhouse Lane, Leeds LS2 9HD, England ‡

ABSTRACT: It is important that in silico models for use in chemical safety legislation, such as REACH, are compliant with the OECD Principles for the Validation of (Q)SARs. Structural alert models can be useful under these circumstances but lack an adequately defined applicability domain. This paper examines several methods of domain definition for structural alert models with the aim of assessing which were the most useful. Specifically, these methods were the use of fragments, chemical descriptor ranges, structural similarity, and specific applicability domain definition software. Structural alerts for mutagenicity in Derek for Windows (DfW) were used as examples, and Ames test data were used to define and test the domain of chemical space where the alerts produce reliable results. The usefulness of each domain was assessed on the criterion that confidence in the correctness of predictions should be greater inside the domain than outside it. By using a combination of structural similarity and chemical fragments a domain was produced where the majority of correct positive predictions for mutagenicity were within the domain and a large proportion of the incorrect positive predictions outside it. However this was not found for the negative predictions; there was little difference between the percentage of true and false predictions for inactivity which were found as either within or outside the applicability domain. A hypothesis for the occurrence of this difference between positive and negative predictions is that differences in structure between training and test compounds are more likely to remove the toxic potential of a compound containing a structural alert than to add an unknown mechanism of action (structural alert) to a molecule which does not already contain an alert. This could be especially true for well studied end points such as the Ames assay where the majority of mechanisms of action are likely to be known.

’ INTRODUCTION Structural alerts are in silico tools that can be used to predict potential toxicity directly from the presence of a particular (sub)structure. They are useful and powerful tools in predicting toxicity due to their transparency and potential mechanistic basis.13 The use of in silico models and, in particular, structural alerts is most appropriate when query compounds are within the applicability domain of the model.4 With regard to in silico toxicology, the applicability domain of a model can be defined as “the response and chemical structure space in which the model makes predictions with a given reliability”.5 The purpose of the applicability domain is to define the theoretical space in which a model can make reliable predictions. It provides the user with information which helps them decide the appropriateness of the model to make a prediction for a particular compound, thus reducing model misuse. Defining the applicability domain is commonly achieved by using the information available from the training chemicals. The definition, characterization, and use of applicability domains have been the focus of much recent research due to their emphasis within the OECD Principles for the Validation of r 2011 American Chemical Society

(Q)SARs.511 Structural alert models often fulfill only four of the five OECD principles.1,4,7,1215 It is important that structural alert models are fully characterized and evaluated if they are to be of use in the context of chemical safety legislation (e.g., REACH).7 There are particular difficulties with assigning an applicability domain to the structural alert, and this domain is the subject of this paper. Numerical relationships between calculated or measured chemical descriptors and measured toxicity values from a training set of compounds are the basis of many applicability domain definition techniques.5,8,11,16,17 Applicability domains for structural alerts have not been defined using these techniques because the alerts are not based on this type of numerical data. Instead, they are often a combination of structural information, toxic or nontoxic testing outcomes, and expert knowledge which are used to directly link substructures with potential activity. The use of expert knowledge causes further problems when attempting to define the applicability domain of structural alerts. It can be Received: March 11, 2010 Published: April 13, 2011 975

dx.doi.org/10.1021/ci1000967 | J. Chem. Inf. Model. 2011, 51, 975–985

Journal of Chemical Information and Modeling

ARTICLE

Figure 1. How the applicability domain can be defined in a useful manner.

argued that it is almost impossible to define the entire training set which is “stored” in experts’ memories. This lack of a clear training set makes it difficult to define the domain. Ellison et al.18 discussed these problems and attempted to define the applicability domain of structural alerts using a set of structurally fragmented training data selected from the literature. The theory applied by Ellison et al., explaining how fragments can be usefully used to define the domain, is summarized in Figure 1. Although the method presented by Ellison et al. showed some potential, it did not result in an applicability domain with practical applications. Ellison et al. found that the majority of compounds that were correctly predicted as active or in-active by the structural alert model were nevertheless outside of the applicability domain. Considering this, it would be advantageous to find a method to define the applicability domain of structural alerts that is based on structural information and results in a practical applicability domain. One aspect of the domain definition process for structural models, such as Derek for Windows (DfW), that was not raised by Ellison et al.18 is the fact that an outcome such as “nothing to report”, if strictly interpreted, is always within the domain of the model. The outcome of “nothing to report” from DfW means that the program has been unable to find a structural alert in the structure of a compound, and there are no other rules in the knowledge base to suggest either activity or inactivity. If the domain is defined as the response and chemical space in which a model can make a prediction with a given reliability, then the reliability of the outcome “nothing to report” is thus 100%; the user can be 100% sure that the program knows nothing about the toxic potential of that compound. However, the assessment of “nothing to report” is of little use, and for well studied end points, such as skin sensitization and mutagenicity, it is often taken as a prediction of inactivity.19,20 In essence, the rule “If no alerts are fired, then activity is improbable” is being assumed. No such rule has been included in structural alert models, including DfW, because of concerns regarding the rule’s validity. However, if an applicability domain can be defined for such a rule, it may be possible to incorporate the rule into programs such as DfW. This

could increase the predictive range of the program and avoid the weaknesses associated with current user inference. If the outcome of “nothing to report” is used in this manner, then the “prediction” is no longer automatically within the applicability domain. To enable a domain to be defined which would facilitate the development of a rule which would tell the user that no alerts have been found in a compound and therefore toxicity is improbable, the domain has to be defined using the “nothing to report” outcome from DfW as a prediction of inactivity. DfW only predicts inactivity (doubted, improbable, or impossible) in two situations: (i) when a molecule meets the criteria of specific rules which indicate that toxicity is unlikely (e.g., DfW contains a rule which states that for any compound with a molecular weight under 94, oestrogenicity is improbable); (ii) when a compound is contained in the knowledge base and has been shown to be inactive experimentally. Thus negative predictions do not occur very often. As noted above, if a system has very good coverage of the alerts that lead to a given end point (which is thought to be the case for Ames test mutagens in DfW), then the absence of any alert from a query structure is an argument for the structure to be inactive. The argument is not as strong as one based on a direct rule for inactivity, but this can be taken into account in a reasoning-based system that is designed to work with uncertain evidence. In this study, the rule was applied to facilitate the definition of a domain which was fit for purpose. There are currently several different possible methods for defining the applicability domain of predictive models. However, there have been few investigations into how well any of these methods perform for structural alerts and if they result in applicability domains with practical applications. The aim of this study was to assess several methods for defining the applicability domain of structural alerts, specifically examining if any method resulted in a more practical domain than the others. The study incorporated compounds which received a “nothing to report” outcome from DfW. If experimentally inactive compounds not containing alerts fell within the resultant domain and active compounds not containing an alert fell outside of the domain this 976

dx.doi.org/10.1021/ci1000967 |J. Chem. Inf. Model. 2011, 51, 975–985


ARTICLE

Table 1. Rules from Derek for Windows (DfW) Which Are Associated with Chemical Descriptors end point alpha-2-mu-globulin nephropathy

chemical descriptor involved molecular weight (MW)

rule if MW > 350: toxicity in rats is doubted toxicity in other mammals is impossible

oestrogenicity

molecular weight (MW)

if MW > 1000: toxicity in mammals is improbable if MW < 94: toxicity in mammals is improbable

photoallergenicity

partition coefficient (log P) used in Potts and Guy equation to calculate the skin permeability coefficient (log Kp)

skin sensitization

if log Kp < 5: toxicity in mammals is improbable

partition coefficient (log P) used in Potts and Guy equation

if log Kp < 5:

to calculate the skin permeability coefficient (log Kp)

toxicity in mammals is improbable

would be the first step in providing support for using a rule in the form of “if no alert exists then activity is improbable” into knowledge based systems. Three general methods for defining the domain were examined: fragments, chemical descriptor ranges, and similarity, along with the AMBIT Discovery software (specialist software for defining the applicability domain). Each approach is discussed in detail in the Methods section. The study was based on the hypothesis that the most useful domain defining technique would be one that resulted in a domain mostly populated with correct predictions. DfW was used as an example that uses structural alerts, and Ames data from the literature21 were used to define and test the applicability domains using each method.

was categorized as a mutagen if at least one Ames test result was positive, whereas a compound was categorized as a nonmutagen if exclusively negative Ames test results (one or more) were reported.21 The data consist of 2401 compounds classified as mutagens and 1936 nonmutagens. The data were stored in both SMILES and SDfile format as DfW, and the programs used to define the applicability domain required different input formats. The data were entered into DfW in the SDfile format and specifically assessed for the presence of mutagenicity alerts. The species selected for analysis was bacteria, which was considered the most relevant species considering the toxicity data was produced through the Ames assay. Once predictions for the mutagenic potential of the compounds in the data had been collected from DfW, the compounds fell into one of four categories: • True positives compounds that showed mutagenic activity in the Ames assay and were predicted as mutagenic by DfW; • False positives compounds that showed no mutagenic activity in the Ames assay but were predicted as mutagenic by DfW; • True negatives compounds that showed no mutagenic activity in the Ames assay and were not predicted to be mutagenic by DfW; • False negatives compounds that showed mutagenic activity in the Ames assay but were not predicted to be mutagenic by DfW. To define and test the domain of applicability, the data were split into training and test sets. The groups of compounds termed training and test sets in this paper are not “training” and “test” sets in the traditional QSAR sense;22 i.e. the training data have not been used to build a model and the test data have not been used to validate it. In this study the term “training set” refers to the compounds used to define the domain, and the term “test set” refers to compounds used to test if model performance improves within the domain. The test set comprised all the false predictions (active compounds predicted inactive and inactive compounds predicted active), 10% of the true positive predictions and 10% of the true negative predictions. The 10% of true positive and negative predictions were selected from the full data set at random. Random numbers were assigned to each compound, and the 10% with the highest values were selected to be test data, using the data analysis platform Pipeline Pilot.23 The remaining true predictions were used as training data. The

’ METHODS Structural Alert Model for Toxicity Prediction. Derek for Windows (DfW) (Version 11) is a knowledge-based predictive toxicology program developed by Lhasa Limited (Leeds, UK). The software predicts if a compound is likely to cause a specific toxic end point using structural alerts and expert rules. The structural alerts are based either on hypotheses relating to mechanisms of action of a chemical class or on observed empirical relationships. DfW currently only covers end points of interest to human health, but predictions can be made for several different species. The likelihood that a chemical will cause toxicity if it contains a structural alert is based on the species in question, as well as rules associated with specific chemical descriptors (Table 1).19 Depending on these factors, i.e. the species in question, presence of a structural alert, and bioavailability, DfW will give one of nine possibilities for the predicted toxicity of a chemical (certain, probable, plausible, equivocal, doubted, improbable, impossible, contradicted, or open). If the compound does not contain any structural alerts and there is no reason based on the physical properties of the compound to predict inactivity, the position is open and the program will return “nothing to report”. For this analysis, predictions of certain, probable, and plausible were taken as positive predictions for mutagenicity, whereas predictions of doubted, improbable, impossible, and nothing to report were taken as predictions of inactivity. Compounds that received an equivocal prediction were removed from the analysis. Toxicity Data. The data used in this analysis consisted of the 4337 compounds collated by Kazius et al.21 The data are a collection of Ames test results from several sources. A compound 977



ARTICLE

training data were used to define the domain, and the test data were used to assess the usefulness of the defined domain. Only 10% of the true predictions were used as test data in order to have a large amount of training data to best define the domain. To fully assess the applicability domain, ten sets of training and test data were selected from the full data set by selecting a different 10% to form the test data from the true predictions. The test data were selected from the full range of true predictions using the random data selection node in Pipeline Pilot. Having ten sets of training and test data enabled the analyses, which are discussed below, to be repeated ten times. This would hopefully eliminate any conclusions drawn from the defined domains being too dependent on any one specific training set. For example, if there were a unique category of compounds in the data and none of them were found in the training data, then they are all likely to be classed as outside of the applicability domain. However, if one or two of them were found in the training set, then the remaining compounds in the unique category would be classified as within the domain. It was hoped that repeating the analysis ten times would give the best coverage of the data. To differentiate between the results produced from each of the ten sets of training and test data they will be referred to as Array 1Array 10. In this analysis the data were not split according to the alert that DfW fired; the applicability domain was defined for the mutagenicity end point as a whole and not for specific alerts. Each alert has its own applicability domain for positive predictions, but there is no meaningful applicability domain for a single alert for negative predictions. Individual alerts do not predict activity they only predict inactivity; if an alert does not fire, one of many others might. The positive predictivity domain for an alert will be dependent, among other things, on the mechanism of action and will thus be particular to the alert. The positive predictivity domain for an end-point in a given computer system will be the union of the positive predictivity domains of all the relevant alerts. The negative predictivity domain for the end point, as commented earlier, is actually the predictivity domain for the rule, “if no alert exists then activity is improbable”. Treating the absence of positive predictions from all alerts (i.e., “nothing to report”) as a negative prediction in this study thus provides a means of testing the validity of the rule, “if no alert exists then activity is improbable”. It is worth noting at this stage that it was not the aim of this study to assess the performance of DfW on the Ames data. Because of the way the test data have been selected, any performance statistics that were calculated would be poor: the test data include all the compounds DfW predicted incorrectly and only 10% of the compounds it predicted correctly. Therefore these statistics will not be given in this analysis. Methods of Defining the Applicability Domain. There are many methods reported in the literature to define the applicability domain of (Q)SAR models,5 but few, if any, are suitable for structural alert models. In this analysis four different approaches were applied to the data described above and are discussed below. Fragment Based Approach. The theory of using fragments to define the domain of applicability is discussed in detail by Ellison et al.18 Here, the fragmentation method used by Ellison et al. was implemented (with slight alterations) as well as an atom pair fragment approach. The domain was defined in the same manner for each fragmentation process (i.e., a compound was considered outside of the applicability domain if it produced fragments that were not present in the training data). It was hoped that using

two methods of fragmentation would highlight whether the types of fragments used to define domains of applicability have an impact on the resultant domain. For example, fragments which take into account every aspect of a molecule may produce a more strictly defined domain compared with more general fragments. The first fragmentation algorithm utilized was developed by the authors, and the method is fully described by Ellison et al.18 However, it was implemented with slight alterations in this analysis. Previously, the algorithm split all possible combinations of one to six bonds in a molecule to produce all possible fragments. Doing so created an exponential number of fragments as the size of the molecules increased, causing there to be large time and computer memory limitations associated with the algorithm. It became apparent that a large portion of the fragments created provided very little information about the structure of the molecule as they were simply the result of splitting ring structures. To overcome this problem a node in Pipeline Pilot24 was used in conjunction with the fragmentation algorithm to prevent the breaking of any bonds contained in a ring structure. A molecule was initially passed through the fragment generation node in Pipeline Pilot to create rings, ring assemblies and chains. The chains then underwent further fragmentation using the Ellison et al. algorithm. The fragments created from the chains were then stored along with the rings and ring assemblies for further analysis. For the purposes of discussion in this paper, the combined use of the Ellison et al. algorithm and Pipeline Pilot fragment generation node will be termed the FragGen method. A second algorithm was used to fragment molecules into atom pair descriptors. The atom pairs generated were closely related to those originally proposed by Carhart et al.:24 encoding descriptions of a pair of atoms, including their element symbol, number of neighbors, the number of π-bonds on each of the atoms, and the path distance between them, measured as an atom count. The molecules were fragmented into atom pairs of all possible lengths up to a defined boundary. The algorithm for generating the atom pairs was written in the Java programming language, using the Eclipse environment.25 The molecular file loading and dissection tools, used throughout the program, were obtained from the chemoinformatics function library JoeLib.26 For both fragmentation algorithms, the training and test data were used in the same manner. The training data were fragmented and all the fragments produced from those data were stored in a single file. The test compounds were fragmented individually to enable the fragments for each test compound to be stored in a separate file (i.e., one file of fragments per test molecule). The fragments from the training set were then searched to see if the test fragments were contained within them using a simple text matching algorithm written in Perl. If a test compound produced fragments that were not contained in the training data, then that compound was considered to be outside of the domain of applicability. Chemical Descriptor Ranges. Chemical descriptor ranges are commonly used to define the applicability domain for traditional QSAR models whose predictions are based on these descriptors.5 The range of chemical descriptors is used to define the limits of the domain because interpolation within chemical space is considered to give more reliable results than extrapolation.16 Using descriptor ranges is possibly not the most appropriate method of defining the domain of structural alert models because the domain would be based on descriptors not used to build to the model. However, consideration of certain chemical descriptors that may affect how a molecule behaves in the test assay 978


Journal of Chemical Information and Modeling could be important:27 for example, the solubility issues related to compounds with a high octanolwater partition coefficient (log P). If a compound has, for example, a log P value that is outside of the range of log P values of the training data, it is impossible to know how that compound would behave in the assay because it is outside the area of knowledge about the data. Therefore, a compound with a log P outside the domain of the training data could be considered as outside the domain of the model. To examine if chemical descriptor ranges would be useful in defining the applicability domain for structural alert models, the logarithm of the octanolwater partition coefficient (log P), molecular weight (MW), and energy of the lowest unoccupied molecular orbital (Elumo) values were calculated for the training and test sets. These three descriptors were selected for this analysis from the large range of chemical descriptors currently available because they are commonly used descriptors, are simple to calculate, and represent the three broad categories of chemical descriptors (hydrophobic, steric, and electronic). The log P and MW calculations were performed using KOWWIN in the Estimation Program Interface (EPI) Suite software (version 1.66), which is freely available from the United States Environmental Protection Agency.28 The Elumo calculations were performed in the Molecular Operating Environment (MOE), which is marketed by the Chemical Computing Group (www.chemcomp.com), using AM1 theory. The minimum and maximum values of the descriptor values of the training data were calculated in Microsoft Excel and were used to define the limits of the domain. If the log P, MW, or Elumo value of a test compound was outside the range of the training data, the compound was considered to be outside of the applicability domain. Structural Similarity. Structural similarity is another method to define the applicability domain and is based on the concept that if a query chemical is “similar” to the chemicals in the training data, then the prediction will be reliable.29 However, similarity is a subjective term; i.e. chemical A cannot be similar to chemical B in absolute term: chemical A can only be similar to B in terms of some measurable feature. This has led to the development of many different methods for calculating structural similarity. These methods are reviewed thoroughly by Nikolova and Jaworska.30 As no chemical descriptors are used in the building of structural alert models, the most appropriate measure to define the applicability domain could be structural similarity. If a query compound can be defined as similar to the compounds in the training set of a model, then that compound can be considered within the applicability domain of the model. This is based on the premise that chemicals that are structurally similar to training compounds can be expected, usually, to show similar toxicological behavior. The structural similarities of the test set compounds to the ones in the training set were measured using the Hellinger Distance (atom environments, summary environments) measurement calculated in Toxmatch.31 Toxmatch is a freely downloadable piece of software recently developed by Ideaconsult Ltd., under a Joint Research Centre (JRC) contract. The core functionalities of the software include the ability to compare data sets based on various structural and descriptor-based similarity indices as well as the means to calculate pair wise similarity between compounds or aggregated similarity of a compound to a set. For this analysis, if the test compounds had a similarity value of 0.6 or higher to one or more of the training set compounds, then they were considered to be within the applicability domain. The cutoff similarity value of 0.6 for compounds to be classed as

ARTICLE

similar to one another was used as it has shown to be an appropriate cutoff elsewhere32,33 (i.e., compounds which have a similarity of 0.6 were likely to have similar activities). AMBIT Discovery. AMBIT Discovery has been developed at the Bulgarian Academy of Sciences and is specifically designed to assess the applicability domains of (Q)SAR models. It is freely downloadable34 and estimates the applicability domain from a training set imported as an SDfile. If the training data are imported with descriptors, then the domain can be assessed using these descriptors, but the domain can also be defined, without imported descriptors, based on structural fingerprints (see Flower35 for further discussion of the use of fingerprints). For this analysis the training data were entered into AMBIT Discovery without any information regarding descriptors or models, and the fingerprint comparison (missing fragments) approach was used to define the domain. The test compounds were then entered, and the program assigned the test compounds as either within or outside of the domain based on the analysis of the fingerprints of the test compounds. Both Toxmatch and AMBIT Discovery have been developed by the same organization and implement fingerprints to evaluate structural similarity. It is likely, therefore, that the fingerprints used in both programs were generated in the same manner, and there may be some overlap in the information contained. However, although the same fingerprints could have been used by both programs, the use of those fingerprints to define the applicability domain was different. This should have resulted in individual domains being developed, and hence a difference between which compounds were found within or outside the applicability domain. Comparison of the Domain Definition Methods. All ten sets of training and test data were separately run through the methods described above. The comparison of the four methods of domain definition was performed by examining the average percentage of true positives, false positives, true negatives, and false negatives from the test sets which were classified as either within or outside of the applicability domain for each of the methods. The hypothesis was that the most useful applicability domain would classify a large proportion of the false predictions as outside of the applicability domain and a large proportion of the true predictions as within it. For this purpose the average percentage of compounds classified as within and outside the applicability domain for each method was calculated. These data were then used to calculate the difference between the percentage of compounds with true and false positive predictions and also the difference between the percentage of compounds with true and false negative predictions difference ¼ percentage of true predictions within the domain percentage of false predictions within the domain The greater the value of the difference between the true and false predictions that were classified as within the domain, the more useful that defined domain is in terms of enabling the user to correctly assign confidence to the predictions made (i.e., to have high confidence in the correct predictions and low confidence in the incorrect predictions).

’ RESULTS AND DISCUSSION The purpose of this study was to examine possible methods for defining the applicability domain of structural alert models. 979



ARTICLE

Table 2. Number of Test Set Compounds Classified As within or Outside of the Applicability Domain for Each of the Methods fragments: array within domain

outside domain

FragGen

Atom Pairs

descriptor ranges Log P

MW

Elumo

similarity Toxmatch

AMBIT Discovery

1

801

1068

1245

1246

1244

943

1232

2

790

1076

1246

1246

1244

941

1224

3

815

1088

1245

1246

1244

945

1232

4

822

1087

1246

1246

1244

968

1231

5

822

1083

1246

1246

1244

961

1221

6

819

1094

1246

1246

1243

947

1232

7

810

1100

1245

1246

1244

963

1235

8 9

815 808

1079 1078

1246 1246

1246 1246

1243 1244

954 979

1236 1234

10

783

1066

1246

1246

1244

972

1233

mean

809

1082

1246

1246

1244

957

1231

SD

13.38

10.76

0.48

0.00

0.42

13.28

4.78

1

445

178

1

0

2

303

14

2

456

170

0

0

2

305

22

3

431

158

1

0

2

301

14

4 5

424 424

159 163

0 0

0 0

2 2

278 285

15 25

6

427

152

0

0

3

299

14

7

436

146

1

0

2

283

11

8

431

167

0

0

3

292

10

9

438

168

0

0

2

267

12

10

463

180

0

0

2

274

13

mean

438

164

0

0

2

289

15

SD

13.38

10.76

0.48

0.00

0.42

13.28

4.78

Chemical Descriptor Ranges. It is clear from Table 2 that the chemical descriptor ranges are not providing any useful information regarding the limits of the applicability domain. For the log P and MW domains the average number of compounds classified as within the domain is 1246, which equates to the entire test set. However, in each of three arrays, one compound was classified as outside the applicability domain because of its log P value. The compounds assigned outside the domain are lycopene [50265-8] (log P 17.64, out of the log P range of the training data in Array 1) and bleomycinic acid [37364-66-2] (log P 11.86, out of the log P range of the training data in Array 3 and Array 7). Although these values may be unrealistic calculated versions of the partition coefficient values for these compounds, the compounds represent the absolute extremes of the log P range found in these data. By chance they were selected to be part of the test data in the arrays in which they appear outside the domain. The two compounds have both been shown to have no mutagenic affect in the Ames test and no mutagenic activity was predicted by DfW. Therefore no useful classification of compounds as outside of the applicability domain has occurred as a result of the use of the chemical descriptor ranges for log P and MW. The results for the Elumo domain appear slightly better than those for log P and MW. On average, two compounds were found to be consistently outside the applicability domain: popane-2-nitronate [20846-00-8] (Elumo 6.22 eV) and cyclopentane nitronate [29916-56-1] (Elumo 6.16 eV). These compounds are outside the domain in every array because they are false negative predictions and have therefore been included in every test set. However, these structures are ionized, and hence the

Doing so effectively would make possible the development of rules allowing compounds which do not fire any alerts and are found within the domain of the model to be predicted as potentially inactive. To this end, 4337 compounds with Ames test data21 were used to define and test several methods of domain definition for the mutagenicity structural alerts in Derek for Windows (DfW). The approaches of domain definition examined were fragments, chemical descriptor ranges, structural similarity, and the specific applicability domain definition software, AMBIT Discovery. Formation of Training and Test Sets. Predictions of plausible, probable, or certain from DfW were taken as positive predictions; doubted, improbable, impossible, and nothing to report were taken as negative predictions (five compounds received an “improbable” prediction, the rest of the negative predictions were inferred from the nothing to report outcome), and equivocal results were removed from the analysis (26 compounds were removed). The data were split on ten separate occasions to form ten sets of training and test data (referred to as Arrays 110). The test set in each array contained the same compounds with false negative (495 compounds) and false positive (385 compounds) predictions. The compounds with true predictions in the test set were different in each array, but the number of them remained the same (i.e., 226 compounds with true positive predictions and 140 compounds with true negative predictions). This equated to a training set of 3065 compounds and a test set of 1246 compounds in each array. Table 2 shows the average number of those 1246 test compounds which were classified as within or outside of the domain of applicability as defined by each of the methods. 980



ARTICLE

Table 3. Average Mean Percentage of Test Seta Compounds with True and False Predictions That Were Classified As within the Domain of Applicability positive prediction method

true

false

difference

negative prediction true

false

difference

Fragments FragGen

84

68

16

66

54

12

Atom Pairs

96

82

14

90

86

4

93 99

69 99

24 0

78 99

75 98

3 1

Similarity Toxmatch AMBIT Discovery a

All of the compounds within the training set were implicitly within the domain; therefore, percentages of these compounds have not been reported.

calculated Elumo values are not strictly comparable to those of unionized compounds. In addition, Elumo calculations for ionized compounds may not reflect the experimental environment of the compounds. Also, as with the log P and MW ranges, correctly predicted compounds were found outside of the Elumo range when they were present in a test set: methanol [67-56-1] (Elumo 3.79 eV) received a true negative prediction but was found outside the domain in Array 6 and Array 8. Thus, it can be concluded that the range of Elumo values did not result in a useful classification of compounds outside of the applicability domain. The results produced when using the chemical descriptor ranges suggest that the training data were diverse and covered most of the chemical space, as described by these descriptors. It is possible that using descriptor ranges would be more useful for defining the limits of specific alerts, for example the reactivity cliffs of a congeneric series.36 However, there were too few data for each specific alert to investigate this in detail for this analysis. For the remaining approaches, it is important to examine not only the absolute numbers of compounds that were classified as within or outside of the applicability domain but also what the prediction status of these compounds was. Table 3 gives the mean average percentage of true and false predictions that were classified as within the applicability domain from the test sets in all ten arrays. The usefulness of the remaining approaches was assessed on the hypothesis that the most useful domain would classify more compounds with false predictions outside of the applicability domain than compounds with true predictions. For this purpose the difference between the percentage of compounds with true and false positive predictions and also the difference between the percentage of compounds with true and false negative predictions were calculated (Table 3). AMBIT Discovery. Under the conditions of domain testing described above, using AMBIT Discovery to define the applicability domain has not proved to be useful for these data. The program has classified a small number of compounds as outside of the applicability domain, and it has been unable to differentiate between the true and false predictions. However, it must be noted that AMBIT Discovery has not been designed to perform under the conditions of this analysis. The software might be expected to perform much better if training data were entered into the program along with the descriptors and model for which the domain needs to be applied. Obviously that was not possible for this analysis because of the way structural alerts are developed. Thus it could be concluded that AMBIT discovery is not an

Figure 2. The average mean percentage of the compounds with true and false positive and negative predictions found within and outside the applicability domain when the domain is defined using (a) FragGen (b) Atom Pairs, and (c) similarity as calculated with the Toxmatch software.

appropriate tool for defining the applicability domain of structural alerts. Similarity and Fragments. The use of chemical descriptor ranges and the AMBIT Discovery software have been shown to be of limited use when attempting to define the applicability domain for these data. This leaves the two fragmentation methods and the similarity method applied through the use of the Toxmatch software. The data in Table 3 are presented differently in Figure 2 for these three methods for ease of viewing. Using the similarity measures that have been calculated in Toxmatch produces the greatest difference between true and false positive predictions. However, the similarity method produces no real difference between the true and false negative predictions. The effect of changing the cutoff similarity values used to assess if a compound was within the domain or not was 981



ARTICLE

Either combination is likely to have little effect on the usefulness of the defined domain if both methods have a large proportion of the compounds classified outside of the domain in common. The data shown in Table 5 demonstrate that this is not the case as the average number of compounds classified outside the applicability domain in common to both methods is much lower than the number of compounds classified outside by either method. Figure 4 displays the outcome when the two methods are combined in either fashion. Combining structural similarity and the use of fragments to define the applicability domain by stating that both methods need to classify a compound as within the domain (combination (i)) gives a slight increase in the difference between the true and false positive predictions (30%age points) as compared with similarity alone (Figure 4a). This combination also results in a difference between the true and false negatives (8%age points) that does not appear when using similarity alone. The latter difference is not as great as that achieved when using FragGen alone to define the domain, but overall the combined method performs better than FragGen. Combination (i) can be considered the better of the two different combinations if the precautionary principle is taken into consideration: it is better to have had low confidence in what turns out to be correct predictions than to have had high confidence in predictions that turn out to be wrong. Combination (i) performed better than combination (ii) because there are relatively few compounds commonly classified outside the domain by both FragGen and structural simlarity (Table 5). Combination (i) also shows that the percentage of compounds commonly classified as within the domain differs between the true and false positive predictions, i.e. there are more true positive predictions commonly classified as within the domain than false positive predictions. Thus when the models concurrently classify a compound as within the applicability domain the prediction is more likely to be true than false. This has resulted in the large difference between the percentage of true and false positive predictions classified as within the domain which is visible in Figure 4a. Therefore it can be concluded that a useful approach to defining the applicability domain for structural alerts may be to combine the use of both fragments and structural similarity. However, the combination of domain definition techniques, similarity using Toxmatch and fragments from FragGen, is more effective for positive predictions than for negative predictions. This means that if a compound is predicted to be active by DfW but is dissimilar to the training data (in terms of both global similarity and the presence of specific fragments) the compound is likely to be experimentally inactive. However, if a compound is not predicted to be active by DfW, dissimilarity between the compound and the training set has little effect on the probability of that compound being active or inactive experimentally. This suggests that when a compound contains a structural alert and is predicted active, it also needs to be structurally similar to the training compounds for that prediction to be reliable. A potential explanation is that a compound containing a structural alert but dissimilar to the training compounds might have a structural feature which inhibits the activity of the structural alert. This mechanism would not apply in every case, which may explain why approximately 20% of the compounds with true positive predictions (Figure 4a) were classified as outside the applicability domain because of structural dissimilarity. However, examining the negative predictions suggests that structures identified as

Figure 3. Example compound with fragments produced from FragGen and Atom Pairs. The atom pairs have been converted into SMARTS patterns for ease of viewing.

not investigated, as changing the cutoff values would not affect the difference between the percentage of true and false predictions predicted within the domain: it would either increase both the percentage of true and percentage of false predictions within the domain or decrease both. The only method to differentiate the true negative and false negative predictions to any degree was FragGen. Interestingly, the two fragmentation approaches, FragGen and Atom Pairs, produced different results. When Atom pairs were used to define the domain, the difference between the true and false positive predictions classified as within the domain was (slightly) larger than when FragGen was used. However, the Atom Pair domain results in a larger overall percentage of compounds classified as within the domain. Unlike FragGen, the Atom Pair method makes virtually no differentiation between the true and false negative predictions. The possible reason for these differences is that the fragments produced by FragGen are defining the domain in stricter terms as the fragments themselves contain more information (Figure 3). For example a SMILES string of a fragment produced by FragGen fully details all the atoms in the fragments, whereas the Atom Pair fragments only detail the terminal features. Therefore, there is an increased chance that the larger fragments produced by the FragGen process will not match the fragments from the training set. This causes a greater number of molecules to be classified as outside of the applicability domain when using the FragGen method. Combining Methods. The two methods that best define the domain of applicability for these data are therefore structural similarity (calculated using Toxmatch) and chemical fragments (generated using FragGen). The similarity method forms a domain which causes a large difference in the number of compounds with true and false positive predictions classified as within the domain (24%age points), and the FragGen method is the only method that has been able to produce any differentiation between compounds with true and false negative predictions (12%age points). Ideally, both of these characteristics need to be combined, and perhaps, therefore, a more useful applicability domain could be defined by combining the methods. There are two possible ways that two methods of defining the applicability domain could be combined: (i) both methods would need to classify a compound as within the domain for the compound to be considered as within the applicability domain or (ii) both methods would need to classify a compound as outside the domain for the compound to be considered as outside the applicability domain (Table 4). 982



ARTICLE

Table 4. Possible Overall Outcomes When the Domain Classifications from FragGen and Similarity Are Combined FragGen classification

similarity classification

in domain out domain

in domain

out domain

combination (i): in domain

combination (i): out domain

combination (ii): in domain





combination (ii): out domain

Table 5. Average Number of Compounds Classified As Outside of the Applicability Domain When It Is Defined Using the Similarity Measures Calculated by Toxmatch, the Fragments Produced by FragGen, and the Number of Compounds Which Are Commonly Classified As Outside by Both positive prediction true method

mean

negative prediction

false

true

false

SD

mean

SD

mean

SD

mean

SD

FragGen

17

3.20

118

4.88

31

6.16

123

5.38

Similarity

37

7.38

124

11.22

48

8.04

229

1.78

in common

8

2.08

51

5.86

13

5.51

79

8.54

dissimilar are not, in general, active. Approximately 50% of the compounds with true negative predictions are structurally dissimilar, but so are approximately 50% of the compounds with false negative predictions (Figure 4a). It appears from these data that altering the structure of a query compound so that it is dissimilar from training data is far more likely to remove activity than to create it. Perhaps this suggests that most of the structures associated with causing mutagenicity are already known and are therefore present in the training data. The Use of “Nothing to Report” As a Prediction of Inactivity. One of the overall advantages of successfully defining the applicability domains of structural alerts would be the possibility of applying some level of confidence to the prediction of inactivity on the basis of the absence of any alert. However it would have to be shown that the compounds which do not contain any alerts and are experimentally inactive are found within the domain and experimentally active ones are found outside of it. However, the results of this particular study have shown this not to be the case. The results of this study therefore do not corroborate the argument that the absence of structural alerts implies inactivity. However, for the 1936 nonmutagenic compounds from the Ames data set, DfW had “nothing to report” for 74% of them, and for the 2401 mutagenic compounds, only 20% received the output “nothing to report”. The high proportion of inactive compounds which get a “nothing to report” outcome could be just coincidence. The small proportion of active compounds which receive “nothing to report” highlight the areas of reactive space which are not currently covered by the alerts. However, as the knowledge in DfW increases and more and more of the mechanistic basis of the mutagenic compounds is discovered and converted into alerts, the above stated percentages should be expected to improve. The difficulty arises when trying to differentiate between the majority of compounds which receive a “nothing to report” outcome and are not active and the

Figure 4. The average mean percentage of the compounds with true and false positive and negative predictions found within and outside the applicability domain when the domain is defined using similarity and FragGen: (a) both methods need to classify a compound as within the applicability domain for it to be considered in, and (b) both methods need to classify a compound as outside the applicability domain for it to be considered out.

minority which are active because of unknown alerts. It was hoped that an applicability domain able to describe the chemical space of the compounds “known” to the model would be able to exclude the false negative predictions (i.e., mutagens wrongly assumed to be inactive because they received a “nothing to report” outcome); the feature of the molecule causing activity might be expected to be “unknown” to the model and therefore a valid criterion for exclusion from the applicability domain. However the results show that the applicability domains developed in this paper were not able to make this differentiation.

’ CONCLUSIONS The idea of using multiple methods to define applicability domains is present throughout the (Q)SAR literature.5,8,11,16,37 It is therefore of little surprise that using multiple methods has 983



ARTICLE

proved to be the most successful for structural alerts. Incorporating multiple domain definition methods allows different factors to have a role in defining the limits of the domain. Using multiple techniques leads to a more thoroughly defined domain, and it has been shown that compounds found within these domains are more likely to be correctly predicted than those found outside them.11 In this case, incorporating information on whether a compound is “globally” similar to the training set (similarity using Toxmatch) along with specific information about any unusual features the compound may contain (fragments generated by FragGen) has proved the most useful. However, going further and adding other methods (e.g., descriptor ranges) does not appear to add anything to the resultant domain at least for the compounds used in this study. This highlights the need to assess which of the many possible techniques for defining the domain are appropriate for any given model. This study has shown that through the use of structural similarity and chemical fragments, the applicability domain can be defined in a useful manner, especially for the positive predictions (albeit with only a modest discrimination between the correct and incorrect predictions). That is to say, compounds for which correct predictions are made and are found within the applicability domain are more likely to be correctly predicted than those found outside of the domain. However, this was not the case for compounds which were classified as inactive. It was hoped that it would be possible to define an applicability domain that would exclude compounds acting via an unknown mechanism; i.e. compounds that were active experimentally but did not fire any alerts. However, compounds which were experimentally active and did not contain any alerts were equally likely to be found within or outside the structural domain. Therefore, it has to be concluded that the applicability domain definition techniques as described and applied in this study will not aid in the development of an “if no alerts are present then activity is improbable” type of rule. Further work on different end points and different applicability domain techniques is required to seek solutions to the problems encountered here when defining the structural applicability domain.

(2) Schultz, T. W.; Yarbrough, J. W.; Hunter, R. S.; Aptula, A. O. Verification of the structural alerts for Michael acceptors. Chem. Res. Toxicol. 2007, 20, 1359–1363. (3) von der Ohe, P. C.; Kuhne, R.; Ebert, R.; Altenburger, R.; Liess, M.; Sch€u€urmann, G. Structural alerts, a new classification model to discriminate excess toxicity from narcotic effect levels of organic compounds in the acute daphnid assay. Chem. Res. Toxicol. 2005, 18, 536–555. (4) Hulzebos, E. M.; Posthumus, R. (Q)SARs: Gatekeepers against risk on chemicals?. SAR QSAR Environ. Res. 2003, 14, 285–316. (5) Netzeva, T. I.; Worth, A. P.; Aldenberg, T.; Benigni, R.; Cronin, M. T. D.; Gramatica, P.; Jaworska, J. S.; Kahn, S.; Klopman, G.; Marchant, C. A.; Myatt, G.; Nikolova-Jeliazkova, N.; Patlewicz, G. Y.; Perkins, R.; Roberts, D. W.; Schultz, T. W.; Stanton, D. T.; van de Sandt, J. J. M.; Tong, W. D.; Veith, G.; Yang, C. H. Current status of methods for defining the applicability domain of (quantitative) structure-activity relationships - The report and recommendations of ECVAM Workshop 52. ATLA, Altern. Lab. Anim. 2005, 33, 155–173. (6) Gramatica, P. Principles of QSAR models validation: internal and external. QSAR Comb. Sci. 2007, 26, 694–701. (7) Hulzebos, E.; Sijm, D.; Traas, T.; Posthumus, R.; Maslankiewicz, L. Validity and validation of expert (Q)SAR systems. SAR QSAR Environ. Res. 2005, 16, 385–401. (8) Nikolova-Jeliazkova, N.; Jaworska, J. An approach to determining applicability domains for QSAR group contribution models: An analysis of SRC KOWWIN. ATLA, Altern. Lab. Anim. 2005, 33, 461– 470. (9) Schultz, T. W.; Hewitt, M.; Netzeva, T. I.; Cronin, M. T. D. Assessing applicability domains of toxicological QSARs: Definition, confidence in predicted values, and the role of mechanisms of action. QSAR Comb. Sci. 2007, 26, 238–254. (10) Worth, A. P.; Hartung, T.; van Leeuwen, C. J. The role of the European Centre for the Validation of Alternative Methods (ECVAM) in the validation of (Q)SARs. SAR QSAR Environ. Res. 2004, 15, 345– 358. (11) Dimitrov, S.; Dimitrova, G.; Pavlov, T.; Dimitrova, N.; Patlewicz, G.; Niemela, J.; Mekenyan, O. A stepwise approach for defining the applicability domain of SAR and QSAR models. J. Chem. Inf. Model. 2005, 45, 839–849. (12) Gerner, I.; Barratt, M. D.; Zinke, S.; Schlegel, K.; Schlede, E. Development and prevalidation of a list of a list of structure-toxicity relationship rules to be used in expert systems for prediction of the skinsensitising properties of chemicals. ATLA, Altern. Lab. Anim. 2004, 32, 487–509. (13) Maslankiewicz, L.; Hulzebos, E.; Vermeire, T. G.; Muller, J. J. A.; Piersma, A. H. Can chemical structure predict reproductive toxicity?; RIVM report 601200005/2005; 2005. Published online: http://rivm. openrepository.com/rivm/bitstream/10029/7374/1/601200005.pdf (accessed January 14, 2011). (14) Netzeva, T. I.; Schultz, T. W. QSARs for the aquatic toxicity of aromatic aldehydes from Tetrahymena data. Chemosphere 2005, 61, 1632–1643. (15) Pearl, G. M.; Livingstone-Carr, S.; Durham, S. K. Integration of computational analysis as a sentinel tool in toxicological assessments. Curr. Top. Med. Chem. 2001, 1, 247–255. (16) Jaworska, J.; Nikolova-Jeliazkova, N.; Aldenberg, T. QSAR applicability domain estimation by projection of the training set in descriptor space: A review. ATLA, Altern. Lab. Anim. 2005, 33, 445–459. (17) Dragos, H.; Gilles, M.; Alexandre, V. Predicting the predictability: A unified approach to the applicability domain probelm of QSAR models. J. Chem. Inf. Model. 2009, 49, 1762–1776. (18) Ellison, C. M.; Enoch, S. J.; Cronin, M. T. D.; Madden, J. C.; Judson, P. A structural fragment based approach to define the applicability domain of knowledge based predictive toxicology expert systems. ATLA, Altern. Lab. Anim. 2009, 37, 533–547. (19) Patlewicz, G.; Aptula, A. O.; Uriarte, D. W.; Roberts, D. W.; Kern, P. S.; Gerberick, G. F.; Kimber, I.; Dearman, R. J.; Ryan, C. A.; Basketter, D. A. An evaluation of selected global (Q)SARs/expert

’ AUTHOR INFORMATION Corresponding Author

*Fax: 0151 231 2170. E-mail: [email protected].

’ ACKNOWLEDGMENT Claire Ellison and Richard Sherhod acknowledge the funding of Lhasa Limited, Leeds, England. The funding of the European Union sixth Framework CAESAR Specific Targeted Project (SSPI-022674-CAESAR) and the European Chemicals Agency (EChA) Service Contract No. ECHA/2008/20/ECA.203 is also gratefully acknowledged. The atom pair descriptors were generated by Richard Sherhod using JOELib, and we gratefully acknowledge the provision of this software by the University of T€ubingen, Germany. ’ REFERENCES (1) Hulzebos, E.; Walker, J. D.; Gerner, I.; Schlegel, K. Use of structural alerts to develop rules for identifying chemical substances with skin irritation or skin corrosion potential. QSAR Comb. Sci. 2005, 24, 332–342. 984



ARTICLE

systems for the prediction of skin sensitisation potential. SAR QSAR Environ. Res. 2007, 18, 515–541. (20) Cariello, N. F.; Wilson, J. D.; Britt, B. H.; Wedd, D. J.; Burlinson, B.; Gombar, V. Comparison of the computer programs DEREK and TOPKAT to predict bacterial mutagenicity. Mutagenesis 2002, 17, 321–329. (21) Kazius, J.; McGuire, R.; Bursi, R. Derivation and validation of toxicophores for mutagenicity prediction. J. Med. Chem. 2005, 48, 312–320. (22) Eriksson, L.; Jaworska, J.; Worth, A. P.; Cronin, M. T. D.; McDowell, R. M.; Gramatica, P. Methods for reliability and uncertainty assessment and applicability evaluations of classification- regressionbased and QSARs. Environ. Health Perspect. 2003, 111, 1361–1375. (23) Pipeline Pilot, version 7.5; Accelrys: San Diego, CA, 2008. http://accelrys.com/products/scitegic/ (accessed March 30, 2009). (24) Carhart, R. E.; Smith, D. H.; Venkataraghavan, R. Atom pairs as molecular features in structure-activity studies: definition and applications. J. Chem. Inf. Comput. Sci. 1985, 25, 64–73. (25) Eclipse; The Eclipse Foundation: Ottawa, Canada, 2010. http://www.eclipse.org/ (accessed February 12, 2010). (26) JoeLib; The University of T€ubingen: T€ubingen, Germany, 2008. http://www.ra.cs.uni-tuebingen.de/software/joelib/introduction. html (accessed February 12, 2010). (27) EpiSuite; U.S. Environmental Protection Agency: Syracuse, NY, 2009. http://www.epa.gov/oppt/exposure/pubs/episuite.htm (accessed July 5, 2010). (28) Mackay, D.; Arnot, J. A.; Petkova, E. P.; Wallace, K. B.; Call, D. J.; Brooke, L. T.; Veith, G. D. The physicochemical basis of QSARs for baseline toxicity. SAR QSAR Environ. Res. 2009, 20, 393–414. (29) Kuhne, R.; Ebert, R.; Schuurmann, G. Model selection based on structural similarity-method description and application to water solubility prediction. J. Chem. Inf. Model. 2006, 46, 636–641. (30) Nikolova, N.; Jaworska, J. Approaches to measure chemical similarity - a review. QSAR Comb. Sci. 2003, 22, 1006–1026. (31) Toxmatch; Ideaconsult Ltd.: Sofia, Bulgaria, 2008. http://ecb. jrc.it/qsar/qsar-tools/index.php?c=TOXMATCH (accessed August 28, 2008). (32) Enoch, S. J.; Cronin, M. T. D.; Madden, J. C.; Hewitt, M. Formation of structural categories to allow for read-across for teratogenicity. QSAR Comb. Sci. 2009, 28, 696–708. (33) Hewitt, M.; Ellison, C. M.; Enoch, S. J.; Madden, J. C.; Cronin, M. T. D. Integrating (Q)SAR models, expert systems and read-across approaches for the prediction of developmental toxicity. Reprod. Toxicol. 2010, 30, 147–160. (34) AMBIT Discovery; Ideaconsult Ltd.: Sofia, Bulgaria, 2008. http://ambit.acad.bg/ (accessed August 24, 2009). (35) Flower, D. R. On the properties of bit string-based measures of chemical similarity. J. Chem. Inf. Comput. Sci. 1998, 38, 379–386. (36) Maggiora, G. M. On outliers and activity cliffs - why QSAR often disappoints. J. Chem. Inf. Model. 2006, 46, 1535. (37) EChA. Guidance on Information requirements and chemical safety assessment. Chapter R.7a. Endpoint specific guidance. 2008. Available at http://guidance.echa.europa.eu/docs/guidance_document/information_requirements_r7a_en.pdf?vers=20_08_08 (accessed March 15, 2009).

985


Assessment of Methods To Define the Applicability Domain of

Recommend Documents