Evaluation of Different Methods for Identification of ... - ACS Publications

May 9, 2017 - resulted in redundancy so that their predictive performance was not as good .... python library for the derivation of structural alerts ...
0 downloads 0 Views 2MB Size
This is an open access article published under an ACS AuthorChoice License, which permits copying and redistribution of the article or any adaptations for non-commercial purposes.

Article pubs.acs.org/crt

Evaluation of Different Methods for Identification of Structural Alerts Using Chemical Ames Mutagenicity Data Set as a Benchmark Hongbin Yang, Jie Li, Zengrui Wu, Weihua Li, Guixia Liu, and Yun Tang* Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China S Supporting Information *

ABSTRACT: Identification of structural alerts for toxicity is useful in drug discovery and other fields such as environmental protection. With structural alerts, researchers can quickly identify potential toxic compounds and learn how to modify them. Hence, it is important to determine structural alerts from a large number of compounds quickly and accurately. There are already many methods reported for identification of structural alerts. However, how to evaluate those methods is a problem. In this paper, we tried to evaluate four of the methods for monosubstructure identification with three indices including accuracy rate, coverage rate, and information gain to compare their advantages and disadvantages. The Kazius’ Ames mutagenicity data set was used as the benchmark, and the four methods were MoSS (graph-based), SARpy (fragment-based), and two fingerprint-based methods including Bioalerts and the fingerprint (FP) method we previously used. The results showed that Bioalerts and FP could detect key substructures with high accuracy and coverage rates because they allowed unclosed rings and wildcard atom or bond types. However, they also resulted in redundancy so that their predictive performance was not as good as that of SARpy. SARpy was competitive in predictive performance in both training set and external validation set. These results might be helpful for users to select appropriate methods and further development of methods for identification of structural alerts.



substructures for alerts.9 Although only a few substructures were identified from a small data set by Ashby, it was widely accepted and developed.10,11 Afterward, several toxicity end points and structural types were intensively explored by experts, and their limitations such as applicability domain were also studied.12,13 Therefore, detection of potential substructures for alert is highly desirable, which could quickly identify toxic compounds from a large data set and guide drug design. Initially, experts identified structural alerts through exploring structure−activity relationships (SARs) of toxic compounds14 to create an expert system. Derek Nexus15 and Genetox Expert Alerts from LeadScope16 are two representatives of commercial expert systems to predict toxicity. After that, automated detection of structural alerts has become a hotspot in computational toxicology, and there are several excellent methods and toolkits available.17,18 In general, methodologies implemented to determine structural alerts may be roughly classified into fragmentbased, graph-based, and fingerprint-based approaches (Figure 1). The fragment-based approach first cuts the bonds of each compound to get all possible fragments and then calculates the frequency of the fragments occurring in toxic and nontoxic

INTRODUCTION Chemical toxicity is an important issue in drug discovery and environmental risk assessment. In silico prediction of chemical toxicity is less expensive and time-saving compared to experimental in vitro or in vivo assays.1 Although machine learning methods, such as support vector machine (SVM) and random forest (RF), have obtained high predictive accuracy, they are considered as black boxes. Interpretable models are more significant, friendly, and acceptable for researchers in medicinal chemistry.2 Structural alerts (SA), like an expert system, can help identify key substructures to certain toxicity and avoid potential toxic compounds in very early stages. It has been widely applied not only in drug discovery,3,4 but also in other fields such as cosmetic research and environmental protection.5 Structural alerts are defined as chemical substructures, whose presence may be related to the capability of a substance to cause certain adverse effects to organs.6 Chemical mutagenicity is one of the most widely studied end points for structural alerts. It is usually determined via the famous Ames test, named after the inventor Prof. Bruce Ames, to detect potential mutagens.7 The qualitative reproducibility of the Ames test among laboratories is generally good.8 As early as in 1988, Ashby found strong associations between chemical structures and their mutagenicity to Salmonella and suggested 11 © 2017 American Chemical Society

Received: March 27, 2017 Published: May 9, 2017 1355

DOI: 10.1021/acs.chemrestox.7b00083 Chem. Res. Toxicol. 2017, 30, 1355−1364

Article

Chemical Research in Toxicology

Figure 1. Three categories of approaches implemented to determine structural alerts. (A) Fragment-based methods that cut all possible bonds to obtain substructures. (B) Graph-based methods that search subgraphs by a levelwise algorithm. (C) Fingerprint-based methods that regard the predefined substructures as potential structural alerts.

compound sets. For example, CASE19 uses KLN20 code, a linear expression for a compound, as descriptors to detect SAs. CatSAR21 uses HQSAR, a module implemented in the commercial software SYBYL from Tripos Inc., to generate all possible substructures. Ferrari et al. developed a software named SARpy that can extract structural alerts from data set by fragmentation of all the compounds.22 Golbamaki et al. explored new clues on carcinogenicity-related substructures via SARpy.23,24 The graph-based approach defines chemical molecules as mathematic graphs that consist of a set of vertices and edges that respectively represent the atoms and bonds of compounds. Then the goal of detecting substructures is converted into searching subgraphs.25 MolFea,26 a linear substructure mining program, utilizes the Levelwise Version Space algorithm27 to handle minimum and maximum frequency constraints. Nevertheless, the shortcoming of MolFea is obvious because it only mines linear substructures. This weakness was overcome by a popular algorithm named MoSS,28 which utilizes depth-firstsearch association rules to mine substructures that considers side chains and cycles. Gaston29 is another graph-based mining algorithm, which divides substructures into three subsets, namely paths, trees, and cycles. It allows user-defined node types like [N,O] that indicates nitrogen or oxygen. Ahlberg et al.30 proposed a new graph-based mining method that uses Atom Signature as descriptor. Atom Signature, suggested by Faulon et al.,31 is a linear expression of a compound and can be calculated via Chemical Development Kit.32 Molecular fingerprints can be viewed as a set of predefined fragments.33 For example, the MDL MACCS (Molecular ACCess System)34 fingerprint contains 166 public keys that represent the most common substructures. Thus, they can be used to detect potential structural alerts via calculation of their frequencies in a data set. All these methodologies have their advantages and defects in computing performance, predictive accuracy, or chemical space. They have been widely used in toxicological research and drug discovery.35−38 However, to the best of our knowledge, there is

no study to compare and evaluate these identification methods with a benchmark data set yet. In this paper, we compared four typical methods among the three types of approaches, namely MoSS (graph-based), SARpy (fragment-based), Bioalerts, and FP (both fingerprint-based), using a chemical Ames mutagenicity data set as the benchmark. We evaluated them from three aspects: (1) to assess the substructures identified by each method and analyze their rankings; (2) to compare the substructures with previous findings; and (3) to package the substructure sets detected by each method as rule-based prediction models to evaluate their predictive performance. Then we deeply analyzed the process of the detection systems and summarized their defects, which might be helpful for users to select appropriate methods and further development of methods for identification of structural alerts.



MATERIALS AND METHODS

Data Preparation. The most popular data set for chemical Ames mutagenicity is Hansen’s benchmark that contains 6512 compounds.39 Kazius’ data set,40 a subset of Hansen’s benchmark, is also a commonly used one. To ensure the compatibility, the Kazius’ data set was used as the benchmark here to compare different mining methods. The preparation of the data set was done by Kazius.41 To validate the robustness of the methods, the rest of the Hansen’s data set (after removing the compounds in Kazius’ data set) was used as the external validation set to further evaluate the mining methods. OpenBabel42 was used to convert all the compounds into canonical SMILES format, which can be directly used to mine structural alerts. Methods for Identification of Structural Alerts. Four detecting methods were used in this study, namely MoSS, SARpy, Bioalerts, and FP (fingerprint). Mostly the substructures were described as SMILES, graphs, or fragments that only contain basic information on atoms and bonds. In FP, the substructures were described as SMARTS, an extension of SMILES, which contains more information about the atoms and bonds and allows use wildcard atom or bond types. SMARTS-based structural alerts are more diverse but more complicated. MoSS. MoSS is a graph-based method for structural alerts that generates fragments by embedding them in all molecules in parallel throughout the growth process.28 This search strategy allows for a restricted depth first search algorithm, which results in excellent 1356

DOI: 10.1021/acs.chemrestox.7b00083 Chem. Res. Toxicol. 2017, 30, 1355−1364

Article

Chemical Research in Toxicology

Figure 2. (A) Ideal substructure that separates the compounds into two subsets with lower entropy. The red part indicates the number of active (toxic) compounds, whereas the green part represents the inactive compounds; (B) function curve of entropy with variable of p that means the proportion of active (or inactive) compounds; (C) less desirable substructure in which though the active rate in positive set is high, the corresponding compounds are too rare, and the overall entropy changes little, that is, the IG is low. computing performance and theoretically covers all chemical space. MoSS was implemented in KNIME (V2.10.4).43 We set the fragment size between 2 and 15, the minimum focus support (true positive rate) as 2%, the maximum complement support (false positive rate) as 10%, and other parameters were set as default. SARpy. SARpy (https://sourceforge.net/projects/sarpy/) is a free standalone tool that uses “String mining” to create substructures from the SMILES notation of training compounds.22,23 Different from other methods, only entire branches or entire cycles are considered as potential structural alerts. SARpy evaluates existing substructures by likelihood ratio, a measure of precision intrinsic to the test. In the implementation of SARpy, the atom number was set between 2 and 15, the precision was set to OPTIMAL, and only positive rules (mutagens) were extracted. Bioalerts. Bioalerts (https://github.com/isidroc/bioalerts) is a python library for the derivation of structural alerts from toxicity data sets. It is a fingerprint-based method, which mainly relies on RDkit44 implementation of Morgan fingerprints.45 P-value is used in Bioalerts to evaluate the significance of a substructure. In Bioalerts, threshold_frequency was set to 0.75, and other parameters were set to default (p-value ≤ 0.05, nb ≥ 5). The structural alerts mined by Bioalerts were converted into SMARTS by RDKit. FP. The FP method was composed of three steps: (1) calculation of fingerprint; (2) removal of redundant substructures; and (3) assessment of potential structural alerts and removing redundant. PaDEL-Descriptor (http://yapcwsoft.com/dd/padeldescriptor/, V2.20)46 was used to calculate the fingerprints of MACCS (166 bits), PubChem (889 bits),47 Klekota-Roth (4860 bits),48 and Function Group Substructure (307 bits, classified by Christian Laggner and defined in OpenBabel). After the fingerprints of training compounds were calculated, we could get a matrix that shows the relationships between the compounds and the substructures defined in the fingerprint dictionary. Direct deletion of the duplicated substructures is impractical since we could not automatically determine whether two substructures represented by SMARTS are equivalent. Therefore, we first verified whether the sets of compounds

matching the two substructures were the same, then manually judged whether they were the same substructure. Subtle differences between two substructures were ignored, and we regarded them as the same to avoid redundancy. For example, MACCS63 - “[#7]=[#8]”, KR2380 “[!#1]NO”, and KR4117 - “NO” all indicate double bond between a nitrogen and an oxygen. Though the meanings of the three SMARTS were different, they were considered to be the same, and the latter two were removed. We assessed and refined them through a threshold of ACC (accuracy rate) > 0.75. Evaluation of Substructures and Predictive Models. Indices of Substructures for Assessment. Supposing that molecules can be categorized as mutagen or nonmutagen based on a binary property X, X will have two possible values (1 means mutagen and 0 means nonmutagen). For a potential structural alert (substructure identified by the four methods) T, we used condition t to define compounds that contain the substructure and used t ̅ to define compounds that do not contain the substructure. ACC is the rate between the number of mutagens that contain a certain substructure and the number of all compounds that contain the substructure (eq 1). Positive rate, also called coverage rate, is the probability of compounds containing the substructure (eq 2): ACC (accuracy rate) = P(X = 1 t )

(1)

PR (positive rate) = P(t )

(2)

where P(A) means probability of observing A. P(A|B) is the probability of observing event A given the condition B. Besides the assessment methods mentioned earlier, another more complicated method called “information gain (IG)” was also used to assess a substructure. It was used in decision tree classification models to measure the relevance of an attribute to a category in data mining.49 Shen et al. used this method to evaluate the importance of the substructures in MACCS fingerprint and labeled the highest substructures as structural alerts.33 IG is defined as the difference between the information entropy of the origin data set and the weighted average information entropies of the two data sets separated 1357

DOI: 10.1021/acs.chemrestox.7b00083 Chem. Res. Toxicol. 2017, 30, 1355−1364

Article

Chemical Research in Toxicology by a substructure.50 Eq 3 shows the definition of information entropy, in which p means the probability of molecules in one category. Eq 4 shows the formation of IG used in this study. An ideal substructure whose IG is very high can separate the data set into two: a high positive rate one and a high negative rate one (Figure 2a), in which the two separated data sets have a very low information entropy compared to the original data set. Information entropy as a function depends on the positive rate of a data set. When the data set is the most disordered, that is, positive rate equals 0.5, the entropy equals 1. When the data set is the most regular, that is, positive rate equals 0 or 1, the entropy equals 0. The function tendency is shown in Figure 2b. A substructure with high IG should not only have high ACC, but also be common enough (high PR). A substructure that has low PR with high ACC (Figure 2c) will get a low IG score:

E(p) = − p × log(p) − (1 − p) × log(1 − p)

On the basis of the Kazius’ data set, 129 frequent substructures were obtained from MoSS, while 130 potential structural alerts were found from SARpy. Bioalerts detected 1094 patterns, from which we selected 395 unique substructures whose IG values were higher than 0.002. In FP method, the fingerprints output 6214 bits in total. Then we retained the fragments whose accuracy is higher than 0.75 and discarded the nonsubstructure patterns (e.g., MACCS49 is “[! +0]”). Finally, 173 patterns were left with a manual screening because many patterns were similar (Table S1). Distribution of Substructures with Three Indices. ACC, PR, and IG are the three most important indices that could decide the capability of the substructures to predict the toxicity of a compound. These three indices were hence used to evaluate the performance of the four methods. After calculations, the three indices of each substructure were plotted in Figure 3. Although both the ACC and PR values of a

(3)

IG = E(P(X = 1)) − P(t ) × E(P(X = 1|t )) − P(t ) × E(P(x = 0|t ))

(4)

We calculated the three indices, IG, ACC, and PR, of all the potential structural alerts detected by the mentioned methods. OpenBabel and its python wrapper Pybel51 were used to match SMARTS with compounds and verify equality of two compounds. Performance Evaluation of Predictive Models. For each predictive model, a sorted list of its structural alerts was generated. The structural alerts were sorted by either ACC or IG. Two indicators were calculated to show the accuracy and robustness of the models including positive coverage rate and negative coverage rate described further:

PCR(L) = P(X = 1|CL)

(5)

NCR(L) = P(X = 0| CL)

(6)

where CL means the compounds that contain at least one substructure in the top L places of the potential structural alert list. CL means the compounds that contain none of the top L structural alerts. Comparison of Methods with Previous Work. The substructures identified by the four methods were then compared with the structural alerts reported in previous publications. ToxAlerts52 collected several structural alerts for genotoxic carcinogenicity and mutagenicity from publications including Kazius et al.,40 Benigni et al.,53 Bailey et al.,54 and Ashby’s.7 We then collected these structural alerts and manually discarded the redundant patterns. If two alerts were similar, we would select the more general one. For example, there were three “aromatic nitro” alerts in ToxAlerts. Two of them were detected by Kazius, and the other was defined by Benigni R et al.53 We selected the most general pattern, which is “[a!r0][$([NX3+](=[OX1])([O-]))]”. In addition, previously Kazius detected structural alerts with Gaston from the same data set as we used.41 Therefore, we compared the results from Kazius with all the substructures identified by the four methods here and ranked these substructures with IG values.

Figure 3. Distribution of the substructures with their ACC and PR and colored by IG values.

substructure were expected to be close to 1, it was nearly impossible to get a substructure of which ACC and PR values were both very high. When the ACC value became close to 1, the maximum of PR value was simultaneously decreased. The border of PR (regarding ACC as independent variable) formed a curved virtual boundary. The more close to the virtual boundary a substructure was, the higher IG value it had. The highest IG value was 0.083 and the corresponding ACC = 0.809, PR = 0.306. As the ACC value became higher or lower (while the PR value became lower or higher), the IG value decreased. Figure 4 shows the distribution of the substructures detected by each method. The FP method occupied the highest average IG value, 0.0213, and the highest average PR value. The ACC values of most of its substructures distributed between 0.8−0.9. The average ACC value from FP was not as high as those from Bioalerts and SARpy. Bioalerts had the highest average ACC value, 0.941, and its IG value was also high. The average PR value of the substructures detected by Bioalerts was not high because most of its substructures distributed PR between 0 and 0.05. However, if considering the absolute number of substructures with high PR values (>0.1), Bioalerts was still outstanding. As for SARpy, though it did not remove those substructures with low ACC values (