Chemical Topic Modeling: Exploring Molecular Data Sets Using a

Jul 17, 2017 - Additionally, we used this data set to compare the performance of the two possible training methods provided by the LDA implementation:...
1 downloads 12 Views 5MB Size
Article pubs.acs.org/jcim

Chemical Topic Modeling: Exploring Molecular Data Sets Using a Common Text-Mining Approach Nadine Schneider,*,† Nikolas Fechner,† Gregory A. Landrum,‡ and Nikolaus Stiefl† †

Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, 4002 Basel, Switzerland KNIME.com AG, Technoparkstr 1, 8005 Zurich, Switzerland



S Supporting Information *

ABSTRACT: Big data is one of the key transformative factors which increasingly influences all aspects of modern life. Although this transformation brings vast opportunities it also generates novel challenges, not the least of which is organizing and searching this data deluge. The field of medicinal chemistry is not different: more and more data are being generated, for instance, by technologies such as DNA encoded libraries, peptide libraries, text mining of large literature corpora, and new in silico enumeration methods. Handling those huge sets of molecules effectively is quite challenging and requires compromises that often come at the expense of the interpretability of the results. In order to find an intuitive and meaningful approach to organizing large molecular data sets, we adopted a probabilistic framework called “topic modeling” from the textmining field. Here we present the first chemistry-related implementation of this method, which allows large molecule sets to be assigned to “chemical topics” and investigating the relationships between those. In this first study, we thoroughly evaluate this novel method in different experiments and discuss both its disadvantages and advantages. We show very promising results in reproducing human-assigned concepts using the approach to identify and retrieve chemical series from sets of molecules. We have also created an intuitive visualization of the chemical topics output by the algorithm. This is a huge benefit compared to other unsupervised machine-learning methods, like clustering, which are commonly used to group sets of molecules. Finally, we applied the new method to the 1.6 million molecules of the ChEMBL22 data set to test its robustness and efficiency. In about 1 h we built a 100-topic model of this large data set in which we could identify interesting topics like “proteins”, “DNA”, or “steroids”. Along with this publication we provide our data sets and an open-source implementation of the new method (CheTo) which will be part of an upcoming version of the open-source cheminformatics toolkit RDKit.



INTRODUCTION The amount of data we produce every day is far beyond human processing capacity. Additionally, most of this data is unstructured and unlabeled, making it very difficult for us to gain insights from it. Primarily developed in the text-mining field, topic models1,2 can help to uncover the hidden structure in a data set and allow us to detect connections within the data which might otherwise be overlooked. First mainly applied to texts, topic modeling is today successfully applied in a wide area of different disciplines ranging from text-mining and social sciences to computer vision and biology.3−7 Topic modeling is a probabilistic framework originally developed to extract the hidden thematic structure of a collection of text documents. For example, models have been built on large text corpora like 1.8 million articles from The New York Times or 300 000 publications from Nature.8 Here, the extracted topics can be subsumed, for example, as “Music”, “Literature”, or “Art” for The New York Times and as “Genetics”, “Neural science”, or “Astronomy” for Nature. The extracted thematic structure is quite compatible with the concepts humans would use to organize the texts. This is a major advantage compared to other methods like © XXXX American Chemical Society

clustering which can be difficult to interpret. The simplest description of topic models is that they detect words that tend to co-occur across a large number of documents. These words then represent a topic. So, the result of a topic model is not a list of “named” topics but a set of words for each topic. The model assigns probabilities to the words, but the meaning of the topics is assigned by humans. Furthermore, one basic idea in topic modeling is that documents do not exhibit only one certain topic but comprise a mixture of topics. This mixture or distribution of topics is also provided by the model for each document so that we can search for similar documents based on their thematic compositions. This way of modeling data allows us to explore data sets in a novel, very intuitive way and has been successfully employed in many other areas and applied to other data than text documents. For example, in systems biology or bioinformatics it is used for exploring microarray data and medical data sets or for interpreting gene sets.7,9,10 Zhao and co-worker found that the topic models could retrieve the correct grouping into serotypes Received: May 4, 2017 Published: July 17, 2017 A

DOI: 10.1021/acs.jcim.7b00249 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

The paper is organized as follows: in the first section we present our data sets, more background on the topic modeling approach, and a detailed description of our implementation. Next, we thoroughly evaluate several aspects of the new method with several experiments. Finally, we discuss the results and give an outlook to future perspectives and applications of this novel approach.

of bacteria (∼41 000 samples and 20 different serotypes) and achieve more accurate results than conventional clustering methods in dividing cancer data into subtypes (111 samples and 2 subtypes).7 Other examples concerning the use of topic models in the context of bioinformatics were recently reviewed by Liu et al.11 These and other studies show that topic modeling could be transferred to other types of data than text and that it is a promising alternative to traditional data analysis methods like clustering. Due to the massive amount of chemical structures found in databases, like ZINC,12 PubChem,13 or ChEMBL14 and due to novel technologies, like DNA encoded libraries15 that generate billions of novel molecules, alternative approaches to organize and explore this data are highly sought after. Often clustering approaches like K-means were applied to group compounds in smaller sets. These algorithms usually rely on a similarity measure to divide the molecules into clusters. Defining a similarity between compounds is often not straightforward and in addition, the resulting clusters are often hard to interpret. Though topic modeling does not require a similarity measure and allows interpretability, we do need an algorithm to transform the molecules such that they are compatible with the approach. One such transformation of molecular data to make it suitable for use in a text-mining algorithm has been proposed by Hull and colleagues16 when they developed Latent Semantic Structure Indexing (LaSSI), which extends Latent Semantic Indexing (LSI)17 to molecular data. In contrast to topic modeling, LSI is based on a singular value decomposition of the input matrix and is not a probabilistic model. The resulting model is less flexible than a topic model and leads to a linear subspace that captures most of the variance in the data. In LaSSI a molecule−descriptor matrix equivalent to the document-word matrix in texts is created from the compound set. Molecular fingerprints with counts were used as the descriptors in their study. LaSSI has been successfully applied in virtual screening and was demonstrated to be superior to common similarity searches18,19 in some cases. In this publication, we present the first chemistry related implementation of topic modeling to organizing large molecule sets into “chemical topics”. A chemical topic can be seen as a pattern of co-occurring fragments that recurs across a set of molecules. The focus of this study is on evaluating the performance, interpretability and robustness of the new method. As with all unsupervised learning methods it is difficult to quantitatively measure the performance of a topic model per se. To mitigate this limitation, we used labeled data in our experiments and quantified how well the model could reconstruct series of chemical compounds from a set of molecules. This experiment also allows us to assess the interpretability of a chemical topic model: does the model recover human concepts (the chemical series) from the data? Similar evaluation methods are used in the field of text mining where experiments like word intrusion into a topic or topic intrusion into a document are conducted to quantitatively evaluate how well the inferred topics match human concepts.20 The evaluation of the quality of topic models is an active field of research in which additional aspects like the applicability to novel data21,22 and the stability of a topic model23 have been investigated. Since one of the major advantages of topic modeling is the interpretability of the models, we were also interested in the robustness and stability of a chemical topic model in order to assess how reliable the approach could be for gaining reliable and reproducible insights.



METHODS AND MATERIALS In this section we first introduce the data sets we used and constructed for this study. In the second part the implementation of the chemical topic model is described. Finally, the evaluation methods we used in our study are presented. Data Set. In our study we used the data set II Riniker et al. published in their fingerprint-based virtual screening study.24,25 This data set was extracted from the ChEMBL database26,14 which mainly contains compounds extracted from scientific literature. A typical medicinal-chemistry paper usually includes data on one or two chemical series and sometimes reference compounds which might be structurally different. These chemical series will be the “chemical topics” that we try to retrieve. Riniker and co-workers constructed the data set as follows: from publications of 50 protein targets which have been identified as being difficult for virtual screening they selected only those that contain at least ten active compounds. In addition, targets with less than four publications were discarded from their data set. Finally, their data set consists of 37 targets with 4−37 papers and 10−112 active compounds per paper. Here we further constrained this data set to publications with at least 20 active compounds to obtain larger and more meaningful chemical series. This finally results in 36 targets with 1−26 papers. We constructed two data sets out of these papers: data set A consists of 36 different papers, one randomly chosen for each target. Six compounds were identified to be nonunique for a single target but appeared multiple times for different isoforms of carbonic anhydrase. Those ambiguous compounds were omitted in our data set resulting in 880 different compounds (see Table S1 for detailed information). A second data set (data set B) was constructed by selecting five pharmaceutically relevant targets from the 36 targets of data set A which cover different target classes. For those we extracted all publications from the ChEMBL 22 database. All compounds with a measured Ki, IC50, EC50, or AC50 were included. Molecules which occur in more than one publication and compounds which have more than one activity value assigned were removed from the data set to avoid ambiguities. Papers with less than 15 different compounds were also excluded from the final data set. Table 1 summarizes the final data set. Table 1. Summary of Data Set Ba target

ChEMBL target ID

no. papers

no. molecules

carbonic anhydrase II dopamine D2 receptor MAP kinase p38 alpha dipeptidyl peptidase IV cathepsin S

15 72 10188 11140 11534

38 47 31 43 27

887 1856 959 1330 855

a Both data sets (A and B) are given as csv files in the Supporting Information.

Finally, we employed the whole ChEMBL 22 data set as a large-scale data set to test the scalability of our new method. This data set consists of about 1.6 million unique compounds B

DOI: 10.1021/acs.jcim.7b00249 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 1. Chemical topic modeling workflow. In gray text the terms used in the context of topic modeling of text documents are shown to make the connection to chemical topic modeling.

first step a distribution over topics is randomly chosen for a document. Then, from those topics words are selected by sampling from the respective word distribution. An important part of the concept is that LDA is a mixed-membership model: documents exhibit different topics in different proportions. In reality, only the documents or the distribution over words are observed and the thematic structure of the documents, as implied by the topic distribution, is hidden. LDA provides a tool to infer this hidden thematic structure by estimating a likely posterior distribution of the hidden variables. Those hidden variables or the hidden topical structure can be described more formally: for a corpus of multiple documents D, K different topics are assumed. Each topic k represents a multinomial distribution βk over a fixed vocabulary and is drawn from a Dirichlet distribution. Each document d exhibits a topic distribution θd which is also drawn from a Dirichlet. For each word n of the vocabulary a topic assignment zd,n exists in a document d. Finally, given the observed variablethe distribution of words wd for each document dthe posterior probability can be calculated as

Figure 2. Fingerprint-based fragments. (left) Generation of Morgan FP fragments with radius two. (right) Generation of RDKit FP fragments with a path length between three and five bonds. Linear and branched paths are possible. (gray circles) Atoms in aliphatic rings. (yellow circles) Atoms in aromatic rings. (gray lines) Bonds to atoms which are not part of the fragment. In the Morgan FP those atoms are implicitly included in the invariant; in the RDKit FP they are ignored. (dotted lines) Aromatic bonds.

p(β1: K , θ1: D , z1: D|w1: D) =

that were found for about 11 000 different targets and extracted from more than 65 000 papers. The data set was used as is we did not preprocess the molecules since we just wanted to assess the efficiency, robustness, and runtime of the novel method. Topic Modeling. One of the most widely applied and simplest algorithms for topic modeling is the Latent Dirichlet Allocation (LDA) which was developed by Blei and co-workers.2 This algorithm is also the basis of our chemical topic model, and we briefly summarize it here. More background and algorithmic details can be found in the original publications.2,3,21,8 LDA is a Bayesian probabilistic model. It models a generative process where documents arise from an imaginary random process in which each document is produced from a set of topics and these topics are distributions over a fixed vocabulary. So, in a

p(β1: K , θ1: D , z1: D , w1: D) p(w1: D)

(see ref 3). Here, the nominator is the joint probability of the random variables and the denominator is the marginal probability of the observed variables that is equal to the probability of obtaining this corpus given any topic model. Since the latter is intractable to compute exactly, different methods have been developed to efficiently approximate it (see for example refs 2 and 3). We use the scikit-learn27 LDA implementation which is based on an online variational Bayes algorithm.8,21 We adapted the following parameters for our chemical topic models: number of topics, learning method (we use “batch” as default), maximum number of iterations for the optimization to C

DOI: 10.1021/acs.jcim.7b00249 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 3. Exemplary fragments derived from random molecules using the three different fragment approaches. (gray circles) Aliphatic ring atoms. (yellow circles) Aromatic ring atoms. (gray lines) Neighboring atoms not directly considered for the fragment. (dotted lines) Aromatic bonds.

of the molecules; rare fragments are those that are found in less than 1% of the compounds (0.1% for the larger data sets). We show results using both an unfiltered and filtered molecule− fragment matrix. This matrixthe corpus of our molecule data setis used as input for the LDA algorithm. The final LDA model returns two matrices: the topic−fragment matrix and the molecule−topic matrix. Based on these matrices the topics along with the most probable fragments per topic can be retrieved and visualized. For the molecules, a topic profile showing the probabilities of a molecule to be associated with a certain topiccan be extracted or alternatively, the most likely topic can be assigned to the molecules. In the following, the fragment generation, fragment filtering, and the visualization of topics are explained in more detail. Fragment Generation. We used three different methods to fragment the molecules for the LDA: circular Morgan fingerprints,29 path-based RDKit fingerprints,30 and BRICS fragments.28 This allows investigating the influence of fragment size and shape (smaller and overlapping fragments from the fingerprints vs larger fragments from the BRICS approach). For the Morgan fingerprints, we chose a radius of two bonds and we excluded all fingerprint bits associated with a radius less than two (compare Figure 2 left). Additionally, we adapted the standard Morgan atom invariant to make it less specific: our invariant does not consider formal charge and isotopes but does include information about the aromaticity of an atom. In the case of the RDKit fingerprint, the minimum path length (number of bonds) was set to three bonds while the maximum length was set to five (Figure 2 right). Here the usual invariant is used which only considers atomic number, atom degree, and aromaticity of the atoms. Using the BRICS rules, which encode rules for breaking molecules into retro-synthetically interesting chemical substructures,28 to cut the molecules in fragments, we obtained a different set of fragments. Those are in general larger andmore importantlynot overlapping like the fragments generated using the fingerprint methods. In Figure 3 fragments obtained with the three different methods are shown to highlight how different the fragments from the approaches are.

Table 2. Important Parameters for the Chemical Topic Model parameter

value range

default values

number of topics threshold rare fragments threshold common fragments sample data set size

1−1000 0.0−1.0

0.001

0.0−1.0

0.1

0.1−1.0

fragment type

Morgan, RDK, BRICS

0.1 for big data sets, 1.0 for small data sets Morgan

converge (increased to 100), and the random state (to obtain a reproducible model). The resulting model contains the distribution of topics over documents and the distribution of words over topics. We normalize both matrices by the sum of each row so that we obtain a probability distribution of topics for each document and a probability distribution of words for each topic. This allows investigating the most probable words for a topic, interpreting the meaning of a topic and finally assigning names or labels to the topics. In the following we describe how we adapted the topic modeling approach to chemical data, thereby enabling the organization of collections of chemical compounds into chemical topics. Chemical Topic Modeling. In our chemical topic modeling approach we have defined molecules as documents and the substructures or fragments derived from these molecules as words (see Figure 1). Those fragments could be generated using different fragmentation approaches such as chemical fingerprints, rule-based bond-cutting methods like BRICS,28 or predefined substructures. A detailed description of the fragment generation can be found below. After generating the fragments, a matrix is constructed in which the rows represent the molecules and the columns contain the counts for each fragment occurring in the respective molecule. Finally, this matrix can be filtered by excluding the most common and rare fragments. Common fragments are defined as fragments occurring in more than 10% D

DOI: 10.1021/acs.jcim.7b00249 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 4. Topic model visualization. (top) Example visualization of a 60-topic model of data set A. For each molecule the most probable topic is determined and the molecule is assigned to this topic. The distribution of this assignment is depicted in the histogram indicating the fraction of molecules associated with a certain topic. (top left) Top 3 fragments of topic 21 along with their probability to be associated with this topic. The Morgan FP fragmentation was used to build this model. (Bottom) Top four molecules of topic 21 based on their topic probability; within the structures, fragments with a high probability for topic 21 are highlighted in light blue (the larger the highlight radius the higher the probability).

rare terms from disproportionality influencing the results.31 All of the important parameters of our chemical topic model are summarized in Table 2. One important difference in chemical topic modeling is that the final molecule−fragment matrixthe input matrix for the LDA algorithmmight be sparser than it usually is for text documents. This leads to new challenges such as molecules without features (due to filtering) or the need to include a larger number of topics to obtain a useful model. We discuss these challenges below in the Results section. Visualization. As described above the topic modeling algorithm generates a topic-fragment probability matrix (see Figure 1) containing the probabilities that each fragment is associated with a certain topic. This matrix allows intuitive visualization of the chemical topic model: topics can be directly highlighted within compound structures. The most probable fragments of a topic can help to quickly identify topics that are structurally interesting. This is a distinct advantage of the topic modeling approach compared to clustering methods like K-means where interpretation is difficult because the clusters are defined solely by the compounds that compose them up and their similarities. Figure 4 shows an example visualization of a chemical topic model. In the top of Figure 4 a histogram indicating the fraction of molecules associated with a certain topic can be found. This helps in assessing which of the topics are more general (a higher proportion of molecules usually fell into those) and which are more specific. Furthermore, the molecules assigned with a high probability to a certain topic can be analyzed and the topic (the fragments that are most likely associated with that topic) can be highlighted within the structure of the compounds. In the example in Figure 4 the dichloro-phenyl ring attached to the 2-chinolinone and the N-tert-butyl

Table 3. Results of Different Fragmentation Methods and Filtering of Data Set A unfiltered

method

no. of fragments

mean no. of different fragments per molecule

Morgan FP RDK FP BRICS

2839 5620 553

21.4 170 5.2

a

filtered (thresholds: 0.1 common frag; 0.01 rare frag) mean no. of different fragments per no. of fragments molecule 554 2276 85

14.7 79 2.4

no. of molecules w/o fragmentsa 3 (0.34%) 1 (0.11%) 44 (5%)

In parentheses: percent of the data set (880 molecules).

To avoid bit collisions caused by hashing, the size of the fingerprints was not restricted to a certain size. The fingerprints or BRICS fragments were generated as count vectors using the open-source cheminformatics toolkit RDKit (version 2016.09.2).30 Fragment Filtering. The final fragments for each molecule form the vocabulary of our corpus. Herein we filter out the common and rare fragments: all fragments occurring in more than 10% of the molecules in the data set will be ignored as will fragments appearing in less than 0.1% (data sets with >1000 molecules) or 1% (smaller data sets). These parameters can be adjusted according to the data set or the objective of the experiment. This filtering idea was also adopted from the textmining field where it is usually done as a preprocessing step when building a topic model. Here, texts are filtered to remove “stop words” (terms like “and”, “not”, etc.) and very rare words.8 The removal of rare words is done for two reasons: efficiencyeach element in the vocabulary adds another column to our input and output matricesand to prevent E

DOI: 10.1021/acs.jcim.7b00249 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 5. Selection of the number of topics for data set A. Mean recall and precision for ten runs were chosen as criteria. The median recall/precision of each run was derived from the recall/precision of the major topic (= largest fraction molecules of the series were assigned to that topic) for the 36 different compound series. The results were averaged over ten different runs using different seeds for the LDA algorithm. The circles show the mean value, shaded area highlights the standard deviation. (right) Results using an unfiltered molecule−fragment matrix. (left) Results using a filtered molecule−fragment matrix (rare fragments = 0.01, common fragments = 0.1).

experiment proposed by Blei and co-workers20 and is in a way the reverse experiment to that the chemical series is a humanassigned concept and we check how well the topic model recapitulates it and if the model channels “intruder compounds” into a series. Additionally, we use this setup to explore and optimize different parameters important for the topic model: the number of topics and the fragment types/filtering (compare Table 2). Furthermore, we evaluated the topic stability by testing whether or not the most probable fragments of the major topics of the different targets stay the same across ten different LDA runs, i.e. changing the random number seed for the LDA in each of the runs. This is an important point if we use the model to interpret and to learn from our data. In order to quantify stability we calculated the mean pairwise Tanimoto similarity of the fragment probability vectors resulting from ten different runs in the major topics of the targets, respectively. We used the following version of Tanimoto for continuous values proposed by Sayle and co-workers:32

substituted piperidine ring constitute the major motifs in topic 21. This could also be assumed by investigating the top three fragments of this topic (see Figure 4 top left). The visualization exemplified in Figure 4 is part of our chemical topic model implementation which can be found at GitHub (www.github.com/rdkit/CheTo). Model Evaluation. In order to assess whether topic modeling is applicable to and useful for chemical data, we designed several experiments: in the first experiment, which is a kind of proof of concept, we evaluated if the method is able to extract “reasonable” topics from a set of compounds. For this we used a data set containing various chemical series which were developed for 36 different protein targets (data set A, see above and Table S1). For each of the 36 targets we determined the major topicthe most likely topic which is assigned to the majority of the compounds of a certain seriesand calculated the recall and the precision in this major topic. The recall measures how many of the compounds of a series could be retrieved in the major topic of a target; the precision denotes how many other compounds from different series land in the major topic of a certain target. This experiment enables us to quantitatively evaluate how well the inferred topics match the concepts of medicinal chemists. It is similar to the word/topic intrusion

N

TS(x , y) = F

∑i min(xi , yi ) N

∑i max(xi , yi ) DOI: 10.1021/acs.jcim.7b00249 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 6. Topic model of data set A using 60 topics and the Morgan FP to fragment the molecules. The y-axis shows the 36 targets of data set A. For each molecule in each of the 36 chemical series the assignment to its most probable topic was used to obtain the number of different topics and to calculate the recall and the precision in the major topic (= largest fraction molecules of the series were assigned to that topic). Fragments were filtered (see text). The box plots show the variation of ten different runs using different seeds for the LDA algorithm. (red line) Median. (box) Lower and upper quartile (Q1, Q3). (whiskers) Most extreme, nonoutlier data points (Q1−1.5 × (Q3 − Q1) and Q3 + 1.5 × (Q3 − Q1)). (plusses) Outlier.

From Table 3 it is clear that filtering out common fragments and rare fragments dramatically reduces the vocabulary for our topic model. The major effect comes from filtering out rare fragments, which leads to a reduction of 53−83% in the number of fragments. A disadvantage of the filtering is that we lose a few molecules which only consist of these rare fragments (compare Table 3 last column). Next, after generating molecule-fragment occurrence matrices a topic model can be built by choosing the desired number of topics. In our first experimentthe retrieval of the 36 compound series of data set Athe expected number of topics is almost predetermined. However, it needs to be considered that a medicinal chemistry publication might contain more than one compound series or reference compounds which are not members of the series. Furthermore, some targets in data set A, such as the carbonic anhydrases, are related to each other; this may mean that some of the targets/publications contain compounds from very similar series. Due to these reasons and the fact that we would like to explore the chemical topic model more deeply, we ran the model with different parameter settings: 10 to 100 topics, the three different fragment methods and ten different seeds for the LDA algorithm. Figure 5 shows the result of this experiment: the mean overall recall and precision of ten different LDA runs. In general, the best performing topic models in terms of overall recall and precision for the 36 compound series are

The final experiment was to measure the runtime efficiency of the method in a large-scale experiment of generating a topic model for the 1.6 million molecules of the whole ChEMBL 22 data set.



RESULTS

In the following we first investigate the behavior of our novel approach by assessing its ability to identify compound series in a set of molecules. We determine the number of topics necessary to obtain a good separation of the molecules. Furthermore, we investigate the influence of the different fragmentation methods and filters on the results. The stability of the topicsthe composition of a topic concerning the most probable fragments is analyzed and discussed in the next section. Finally, an application of the method to organize larger sets of molecules is presented. This is complemented by runtime measurements of the method on the ChEMBL 22 data set. Different Fragmentation Methods and Number of Topics. To create a chemical topic model the compounds of the data set first need to be fragmented. Here we have tried three different methods: Morgan and RDKit fingerprints as well as BRICS bond cutting rules. Those methods generate different types of fragments as described above (compare also Figure 3). Furthermore, we analyzed the influence of filtering those fragments to remove common and rare fragments. Table 3 summarizes the results of fragmenting the molecules of data set A. G

DOI: 10.1021/acs.jcim.7b00249 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

experiment we selected a topic model with 60 topics and a filtered Morgan FP fragment matrix as input (see Figure 5 right middle). Again, ten LDA runs with different seeds were carried out since the results of the topic model may vary with the start initialization of the document-topic and the topic-fragment matrix. Figure 6 shows box plots of the results of this experiment: for each of the 36 compound series the number of topics, the recall and the precision in the major topic of a compound series were plotted. More detailed results can also be found in Table S2 in the Supporting Information. One third of the compound series obtain an excellent mean recall and precision of over 90% (see for example “Sphingosine 1-phosphate receptor Edg-1”, “Beta-2 adrenergic receptor”, or “MAP kinase p38 alpha” in Figure 6). The median mean-recall of 92% and the median mean-precision of 94% show that the chemical topic model retrieves most of the chemical series. The worst performance was achieved in the “HERG” series; this could be expected for this known antitarget. Here, no chemical series is described in the original publication33 (see very low pairwise Tanimoto similarity in Table S2); it is just a collection of HERG active molecules so that the topic model could not identify a common structural motif for those molecules and classified them to many different topics (up to 13 different ones). Another set of chemical series showing only a medium performance belong to the carbonic anhydrases. These series all share a sulfonamide group attached to an aromatic ringthe typical zinc-binder motif of those inhibitorsbut nothing more specific. The “carbonic anhydrase IX” series is especially problematic: most of the molecules in this series are rather small and primarily consist of that motif. Due to the similarity between the carbonic anhydrase series, many share a major topic in the chemical topic model (see Figure 7). This leads to a worse performance in precision and could also result in a smaller recall at the same time (compare Figure 6). Being part of the same topic could also happen for other series which share common substructures. For example, the “Muscarinic acetylcholine receptor 1” series and the “Cytochrome P450 19A1” inhibitors: most of the latter contain a terminal biphenyl moiety which can also be found in about onethird of the compounds of the muscarinic acetylcholine receptor 1 series. In one of our models (see Figure 7, run 5) they share their major topic which leads to a medium performance of recall and precision for both series. Another example is the “Melanin-concentrating hormone receptor 1” series and the “Serotonin 2a (5-HT2a) receptor” series, those share a topic in the last run (see Figure 7). Here, the combination of two less common fragmentsa tertiary amine in an aliphatic ring and a trifluoromethylphenylwhich appears in most of the melanin-concentrating hormone receptor 1 compounds but only in one of the serotonin 2a (5-HT2a) receptor molecules ties those series together. The result in this case is a perfect recall for both series but a medium precision due to sharing the major topic. In summary, an analysis of the chemical series sharing their major topic allows identification of related series which contain similar substructures. This could be especially interesting if the substructure is involved in or critical for binding to the target. Finally, 2 of the 36 seriesBeta secretase 1 and Sphingosine 1-phosphate receptor Edg-1are completely stable in all 10 runs and achieve a perfect mean recall and precision of 1.0. These two series are both very homogeneous and contain a large, specific scaffold which makes them special topics. Figure 8 shows these two special topics. The top five fragments of each of the

Figure 7. Compound series sharing major topics. Compound series of data set A which share their major topic within a topic model (= one column: Run 1−10) are highlighted in the same color. Each column shows the result of an independent run. Compound series colored in white/no color can be found in their own major topic. Number of topics chosen: 60. Fragmentation method: Morgan FP. Input: filtered molecule−fragment matrix.

obtained by choosing between 50 and 60 topics. The FP-based fragmentation methodsMorgan and RDKit FPperform better than using the BRICS fragments. In addition, the performance of the model did not change significantly when using the unfiltered matrices for Morgan or RDKit fragments (compare Figure 5 left). This is different for BRICS fragments where a measurable increase in recall can be found (see Figure 5 top) when using the unfiltered fragment matrix. This is not surprising given the small number of BRICS fragments remaining after filtering (see Table 3) which implies that our filtering criteria are too strict for the BRICS fragments. In summary, all the three fragment methods show promising results in this experiment: achieving a mean recall between 90 and 95% and a mean precision in the same range by choosing the appropriate number of topics. Selecting exactly 36 topics to retrieve the 36 compound series would have still resulted in an excellent recall but only in a medium precision of the model (see Figure 5). This is because the LDA method rather optimizes the consolidation of co-occurrences of words across documents to topics than the separation of documents. Due to our evaluation criteria, choosing less than 36 topics necessarily lead to a weak performance in precision. Detailed Analysis of a Chemical Topic Model. Next, we investigated the results in more detail: based on the first H

DOI: 10.1021/acs.jcim.7b00249 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 8. Beta secretase and Sphingosine 1-phosphate receptor Edg-1 topics of the chemical topic model of data set A. Chemical topic model: 60 topics, filtered Morgan fragment matrix, seed 57. (top) Five most probable fragments of both topics along with their scores/probabilities. The latter vary slightly between the different runs. (bottom) Top six molecules of both topics. All of those have a probability of more than 90% for their topics. The topic is directly highlighted in turquoise/light orange within the compound structures.

two topics are depicted. While in the beta secretase topic all of those have an equal probability, in the sphingosine topic the first fragmenta terminal phenyl ring attached to a quaternary carbon in an aliphatic ring (Figure 8 top left)is two times more likely than the other four fragments. Investigating the top six compounds in each of the topics it becomes clear that these topics are highly specialized in those two chemical series: almost the whole scaffolds of the molecules were described by the topics. Topic Stability. We also investigated the stability of the novel approach in terms of the most probable fragments that compose the topics. This is an important aspect when the chemical topic model is used for interpretation and analysis of data as shown in Figure 8. For this we calculated the Tanimoto similarity of the fragment probability vectors for each topic between different runs. We used the major topic of each series to allow a correct mapping of topics across different runs. When calculating the mean pairwise Tanimoto similarity we only considered the most probable fragments in a topic i.e. those that account for 80% of the probability in the topic-fragment probability vector. Table 4 summarizes the results of this experiment when using the Morgan FP fragmentation (results for the other fragment methods can be found in Table S3).

As seen before, the carbonic anhydrase series and the HERG compound series show the lowest mean similarity of the fragment probability vectors in ten different runs. In general, the mean similarity with most of the series above 0.6 is rather high. In order to get a better idea of these values Figure 9 shows the top 5 fragments and the most probable molecule of the carbonic anhydrase XII topic of 10 different LDA runs. The carbonic anhydrase XII topic was selected since the mean similarity was below 0.5. Nevertheless the chemotype does seem to be fairly consistent by investigating the top five fragments of this topic and the topic itself as highlighted within the structure of the most probable compound. Although the order of the top five fragments changes between different runs and in some of the runs (1, 9, 10) other fragments can be found in the top five, the major substructures representing the carbonic anhydrase XII topic stay the same: the urea group between the two phenyl rings and the sulfonamide moiety. In runs 1, 9, and 10 the most probable compound does not belong to the carbonic anhydrase XII series but to the cytochrome P450 3A4 series, which also exhibits this motif of a central urea between two phenyl rings. This experiment shows that even if the composition of a topic in terms of the most probable fragments changes during different runs, the overall meaning of the topic stays rather I

DOI: 10.1021/acs.jcim.7b00249 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling Table 4. Topic Stability Across 10 Different LDA Runsa

matrix and filtered the fragments using the same thresholds as before (0.1 for common fragments, 0.01 for rare fragments). Table 5 summarizes the results of this experiment, and in Figures S1 and S2 further results can be found. A topic model able to almost perfectly retrieve the chemical series could be generated for four of the five. The exception is carbonic anhydrase II where the best topic modelin terms of best balance between median recall and precisionachieves only moderate performance. This is due to different reasons: some of the series share a larger scaffold since one is extracted from a follow-up paper of the other (see for example papers with ChEMBL document IDs 20507 and 5846334,35), other series are very generic and include no special structural features (e.g., ChEMBL document ID 2071836), and, finally, some papers do not contain a series at all but rather a set of molecules (e.g., ChEMBL document ID 3011137). This latter situation leads to a weak performance in both recall and precision. In contrast, series that are strongly related (like series from follow-up papers) will probably land in the same topic and thereby cause a low precision. Additionally, we used this data set to compare the performance of the two possible training methods provided by the LDA implementation: batch and online learning. Using batch learning the whole data set is provided at once to the variational Bayes optimization of the LDA algorithm, while for online learning the model is optimized incrementally by running the optimization with chunks of the data set. The online approach is necessary when building models on large data sets that cannot be kept in memory at once. Since we would like to be able to apply the chemical topic model to larger sets of molecules, like the whole ChEMBL database, it is important to determine if there is a substantial difference in performance between the two learning methods. In Table 5 the results of both methods are shown: there is obviously no major difference in the accuracy. It seems that the online method performs slightly better but takes more topics to reach the higher accuracy. A real comparison of those two learning methods is out of scope for this publication but the result that the performance of both methods is similar makes the next experiment feasible: building topic models on the 1.6 million molecules of the ChEMBL 22 data set. Chemical Topic Modeling on the Whole of ChEMBL 22. Topic modeling has been successfully used in many fields to organize huge collections of documents. To enable that, the document-word matrix needs to be thoroughly filtered as discussed above. Furthermore, online learning is required to handle the huge amount of data and to build a topic model.21 In these final experiments we discuss an efficient strategy we have developed to allow model building on large chemical data sets. For this purpose we used the whole ChEMBL 22 data set14 which contains about 1.6 million unique compounds. We started by determining what fraction of the data set needs to be considered to build a representative final vocabulary for the chemical topic model. Since the molecule−fragment matrix needs to be filtered at the end to remove infrequent fragments, considering every single molecule to build up this matrix might not be necessary. Drawing a smaller random sample of the data set should result in the same or at least a very similar set of fragments after filtering. Figure 10 shows the result of this using the Morgan fragment method and the ChEMBL 22 data set. This experiment clearly illustrates that subsampling 10% of the data set will generate an almost identical vocabulary. We found a coverage of about 95% of the fragments in the

Morgan FP filtered target

mean similarity

std dev

carbonic anhydrase II HERG carbonic anhydrase I carbonic anhydrase XII muscarinic acetylcholine receptor M1 cytochrome P450 19A1 cathepsin S norepinephrine transporter dopamine D3 receptor 11-beta-hydroxysteroid dehydrogenase 1 carbonic anhydrase IX serotonin 2a (5-HT2a) receptor cannabinoid CB2 receptor tyrosine-protein kinase SRC alpha-2a adrenergic receptor serotonin 2c (5-HT2c) receptor C−C chemokine receptor type 2 adenosine A1 receptor cytochrome P450 2D6 vanilloid receptor dipeptidyl peptidase IV melanin-concentrating hormone receptor 1 cyclooxygenase-2 histamine H3 receptor vascular endothelial growth factor receptor 2 dopamine D2 receptor cytochrome P450 3A4 MAP kinase p38 alpha serotonin 1a (5-HT1a) receptor beta-2 adrenergic receptor cannabinoid CB1 receptor glucocorticoid receptor matrix metalloproteinase-2 serotonin transporter beta-secretase 1 sphingosine 1-phosphate receptor Edg-1 median

0.13 0.20 0.36 0.47 0.47 0.51 0.52 0.52 0.56 0.58 0.58 0.58 0.60 0.60 0.67 0.67 0.71 0.72 0.72 0.72 0.73 0.73 0.75 0.75 0.79 0.80 0.81 0.81 0.83 0.84 0.84 0.84 0.85 0.85 0.88 0.91 0.72

0.13 0.23 0.24 0.29 0.21 0.26 0.21 0.29 0.16 0.27 0.20 0.37 0.23 0.21 0.18 0.19 0.14 0.09 0.16 0.12 0.16 0.16 0.21 0.24 0.09 0.10 0.11 0.16 0.14 0.06 0.09 0.11 0.07 0.12 0.05 0.05 0.16

a Chemical topic model: Morgan fragmentation of dataset A, filtered; 60 topics. The mean pairwise Tanimoto similarity and the standard deviation for the most probable fragments of each topic are listed.

constant so that the model could be useful for analyzing chemical data sets. Chemical Topic Modeling for Related Compounds. In order to explore chemical topic modeling in a slightly different setup we created a second data set which consists of five subsets of chemical series for five different protein targets (carbonic anhydrase II, dopamine D2 receptor, MAP kinase p38 alpha, dipeptidyl peptidase IV, and catepsin). For each of the five targets we collected between 27 and 47 chemical series (see Table 1 for a summary). Those series are more closely related to each other than the series in data set A. Consequently, this setup is expected to be more challenging for the topic model to reextract the human-assigned concepts (the different chemical series). As in the experiment before, we ran the LDA algorithm with different numbers of topics to find the best compromise of median recall and precision for the five subsets. We chose the Morgan FP fragmentation to build the molecule-fragment input J

DOI: 10.1021/acs.jcim.7b00249 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 9. Top 5 fragments and most probable molecule of the carbonic anhydrase XII topic across 10 different LDA runs. Chemical topic model: Morgan FP fragmentation, filtered fragment matrix, 60 topics.

To evaluate the model building runtime we generated topic models using between 100 and 500 topics. The runtime was measured for the fitting process of the model and the transformation of the data to the model. To handle the large amount of data we used the online learning method of the LDA implementation and ran it with data chunks of size 5000. Additionally, for these larger models we reduced the number of maximum iterations for the optimization to 10 (for the smaller models we have shown before max_iter was set to 100). Figure 11 shows the result of this experiment. The runtime for the building of a chemical topic model seems to increase linearly with the number of topics (see ref 2 for time complexity of the LDA method). Fitting a 100-topic model on the whole ChEMBL data set took about 1 h on one CPU on the Intel Xeon 64-bit 3.6 GHz eight core processor. As expected, the runtime for transforming the data is much less: only 5 min. An important aspect to consider is also the memory

vocabulary considering only 10% of the data compared to building the vocabulary using all data. The same results have been found for the two other fragment methods (data not shown). This strategy reduces the amount of memory required to store the fragments of each molecule as well as the size of the final molecule-fragment matrix. Building the model using only 10% of the data may also result in a more general model. Using this approach we fragmented the ChEMBL 22 data set using the following parameters: subsampling size 10%, threshold rare fragments 0.1%, threshold common fragments 10%, and fragment method Morgan. With this setting the fragmentation of the whole ChEMBL 22 data set takes about 30 min on four CPU cores of an Intel Xeon 64-bit 3.6 GHz eight core processor. The final vocabulary consisted of 2582 Morgan fragments leading to a molecule−fragment matrix of size (1 599 759, 2582) or 3.85 GB (we use eight bit integers for this count matrix). K

DOI: 10.1021/acs.jcim.7b00249 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

DNA, proteins or cyclic peptides indeed more fragments were associated with these topics (darker blue bars in Figure 12 top). Looking at the top eight fragments of, for example the cyclic peptide or the steroids topic, an experienced chemist could most likely derive which molecules will be associated with these topics. So even for this huge and heterogeneous data set the chemical topic model provides an interesting starting point for further investigation.

Table 5. Final Number of Topics Selected for the Subsets in Data Set Ba target carbonic anhydrase II dopamine D2 receptor MAP kinase p38 alpha dipeptidyl peptidase IV cathepsin S carbonic anhydrase II dopamine D2 receptor MAP kinase p38 alpha dipeptidyl peptidase IV cathepsin S

no. papers

no. topics selected

LDA Batch Learning 38 60 47 80 31 40 43 70 27 40 LDA Online Learning 38 80 47 90 31 60 43 70 27 50

median recall

median precision

0.64 0.87 0.88 0.86 0.92

0.65 0.86 0.93 0.91 0.91

0.73 0.84 1.0 0.86 0.98

0.72 0.82 1.0 0.91 1.0



DISCUSSION Here, we have introduced chemical topic modeling as a novel method to organize sets of molecules. The method is adopted from the text-mining field, where it has been applied very successfully in many different areas (see the Introduction). In this first publication our goal was to investigate how we can apply topic modeling to chemical structures and whether or not the results are useful. Since the method is an unsupervised approach, a quantitative evaluation of performance is challenging. Our approach for this was to measure how well the topic model is able to reproduce human-assigned concepts/ groupings, in our case chemical series. In most of the data sets evaluated the method achieved very good results in retrieving chemical series from a set of molecules. In some cases the chemical topic model was not able to reproduce the human concepts, but instead proposed an alternative organization of the molecules. This highlights subtleties that may not been found using the original grouping of the molecules and might provide an opportunity to change our perspective about the related molecules. As mentioned above, transferring topic modeling from the text-mining field to the chemistry field is not without challenges. The fragmentation of texts and molecules is fundamentally different: for molecules many obvious approaches lead to

a

Best balance between median recall and precision. The performance is shown for two different LDA learning methods: batch and online.

need to handle these huge matrices. Here, an alternative would be to train the model on a smaller subset of the data (20%) and just apply the model to the other data (experiments on this planned for an upcoming publication). Finally, we investigated the 100-topic model of the ChEMBL 22 data set. Some of these topics are very unspecific as their major themes are rather small and contain common substructures like chlorophenyl or pyridine. Other topics in contrast could be entitled as for example “small proteins”, “cyclic peptides”, “steroids”, “morphines”, or “DNA” (see Figure 12). In Figure 12 the results of the 100-topic model of ChEMBL 22 are summarized. We looked into some of these topics in detail and found that for topics which comprise larger molecules like

Figure 10. Determination of subsample size for building the vocabulary of the chemical topic model. Data set: ChEMBL 22. Fragment method: Morgan. (blue) Number of unique fragments generated depending on the subsample ratio of the data set. (green) Size of the final vocabulary by filtering the fragments (threshold rare fragments 0.1%, threshold common fragments 10%). Please note the left y-axis uses a log scale. (red) Percent of overlapping vocabulary fragments compared to using 100% subsample size to build the vocabulary. All results represent the mean value of five randomly drawn subsets. The standard deviation is shown in shaded area (see beginning of the red line). L

DOI: 10.1021/acs.jcim.7b00249 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 11. Runtimes building chemical topic models on the ChEMBL 22 data set. (blue) Runtime fitting process of the model building. (green) Runtime transforming of the data to the model. Please note that the transforming runtime of the 500-topic model is not available due to memory limitations (16 GB RAM).

“overlapping” fragments and thereby to a strong correlation of many of the fragments with each other. This is probably also true for some words in texts (e.g., phrases) but to a lesser extent. In the text-mining field Wallach found that the derived topics are more reasonable when using n-grams of words instead of unigrams.38 Our molecular fragments are not n-grams in the strict sense since they are not necessarily linear, but it can be argued that they share the beneficial property of being overlapping in many cases. This might contribute to the observation that the n-gram-analogous fragments resulting from the Morgan and RDKit fingerprints do seem to work better than the BRICS fragments, which are essentially unigrams. Furthermore, the molecule−fragment matrix is relatively sparse so that filtering out rare and common fragments leads to a more stable and efficient building of the topic model (as shown above) but in some cases results in molecules that have no features. Consequently, appropriate thresholds need to be found for each data set. Another important difference is the granularity of the topics: in the text-mining field the approach is usually used to organize large sets of documents into an overall thematic structure like texts about politics, art, or economy. In chemistry we are typically more interested in a finer grained organization of the molecules, like the application we showed here: grouping chemical series. For large data sets like ChEMBL we found some more general topics like “proteins”, “DNA”, or “steroids”. These general topics invite further investigation since they are sensible and humanly understandable. Obtaining a finer grained organization of those large data sets using a lot more topics in the topic model would induce a much higher complexity of the model, decreasing the feasibility of building such a model. Here, an interesting idea could be the creation of a hierarchical topic model which also has been built and applied in text-mining.5 This way more detailed topics might be found and a kind of ontology for the molecule set is derived. In this regard, we plan to apply the topic modeling to the ChEBI database39 which already is mapped into an ontology and see if the topic model can recapitulate this ontology. Despite these challenges, which need to be investigated in future studies, the new model offers a lot of interesting advantages. In many of the examples above the chemical topic model allows intuitive visualization of the topics mapped

directly onto the molecules. Furthermore, investigating the top fragments of the topics of a model enables quick identification of “interesting” topics to analyze further. This is a distinct advantage of chemical topic modeling over the usual black-box clustering approaches applied to organize molecules in chemistry. Comprehending a model is a huge benefit for researchers and the ability of the chemical topic model to reproduce human-assigned concepts makes it a great tool to explore sets of molecules. The simple graphical user interface we provide along with this publication (see GitHub [www.github.com/rdkit/CheTo]) is a first showcase/demo of interactively visualizing the chemical topic model. Here, more sophisticated implementations could produce a valuable tool for scientists to learn about the hidden structure in molecule sets and discover alternative relations between molecules. Another advantage of this novel method is that the topic model belongs to the class of mixed-membership models and thereby enables a fuzzy clustering. We have not shown this here in the first study of chemical topic modeling but this propertythe topic probability vector of each moleculecan be used as a new descriptor space in similarity searching or in machine learning. Finally, the topic model or LDA describes a generative process which could be used to generate novel molecules: having a chemical topic model of different series of active molecules against a certain target the model could be used to create novel molecules by combining the fragments and the topics. However, as with texts this not straightforward since the model does not know about the grammar or the reaction rules necessary to produce a feasible output. There are several applications in which chemical topic modeling could help to organize and analyze chemical data. One obvious example is the use of the method in organizing HTS results to identify interesting series and scaffolds for hit selection. Especially with the ever increasing cherry picking capabilities, initial screens tend to be smaller and a method to successfully select interesting compounds for expansion screens is becoming more important. Another idea could be the derivation of ontologies for large molecule sets or for reactions types. Furthermore, we can also assign labels to the topics based on properties of the molecules found in a certain topic, e.g. using solubility, “soluble” and “insoluble” topics can be M

DOI: 10.1021/acs.jcim.7b00249 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 12. 100-Topic model of the ChEMBL 22 data set. (top) Mean topic profile of the ChEMBL 22 molecules. Easily interpretable topics are marked with blue arrows. The bars are colored by the number of fragments important for the topic (the darker blue the more fragments are associated with this topic). (bottom) Top eight fragments of four different topics: cyclic peptides, small proteins, DNA/RNA, and steroids. Chemical topic model: 100 topics. Morgan FP fragmentation: filtered matrix (threshold rare fragments 0.1%, threshold common fragments 10%). Seed LDA: 42. Subsampling size: 10% of the ChEMBL 22 data set.

discovered and analyzed to find substructures which may lead to this property. Most of the molecular properties are not induced by a single substructure but by the combination or co-occurrence of those. These combinations might be highlighted be chemical topic model. In conclusion, in this publication we have shown that chemical topic model is a promising new approach, with many

challenges and opportunities left to study, in the era of big data and pattern recognition.



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.7b00249. N

DOI: 10.1021/acs.jcim.7b00249 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling



(15) Clark, M. A.; Acharya, R. A.; Arico-Muendel, C. C.; Belyanskaya, S. L.; Benjamin, D. R.; Carlson, N. R.; Centrella, P. A.; Chiu, C. H.; Creaser, S. P.; Cuozzo, J. W.; et al. Design, Synthesis and Selection of DNA-encoded Small-molecule Libraries. Nat. Chem. Biol. 2009, 5, 647−654. (16) Hull, R. D.; Singh, S. B.; Nachbar, R. B.; Sheridan, R. P.; Kearsley, S. K.; Fluder, E. M. Latent Semantic Structure Indexing (LaSSI) for Defining Chemical Similarity. J. Med. Chem. 2001, 44, 1177−1184. (17) Deerwester, S.; Dumais, S. T.; Furnas, G. W.; Landauer, T. K.; Harshman, R. Indexing by Latent Semantic Analysis. J. Am. Soc. Inf. Sci. 1990, 41, 391. (18) Hull, R. D.; Fluder, E. M.; Singh, S. B.; Nachbar, R. B.; Kearsley, S. K.; Sheridan, R. P. Chemical Similarity Searches Using Latent Semantic Structural Indexing (LaSSI) and Comparison to TOPOSIM. J. Med. Chem. 2001, 44, 1185−1191. (19) Singh, S. B.; Sheridan, R. P.; Fluder, E. M.; Hull, R. D. Mining the Chemical Quarry with Joint Chemical Probes: An Application of Latent Semantic Structure Indexing (LaSSI) and TOPOSIM (Dice) to Chemical Database Mining. J. Med. Chem. 2001, 44, 1564−1575. (20) Chang, J.; Boyd-Graber, J. L.; Gerrish, S.; Wang, C.; Blei, D. M. Reading Tea Leaves: How Humans Interpret Topic Models. Adv. Neural Inf. Process Syst. 2009, 31, 1−9. (21) Hoffman, M.; Bach, F. R.; Blei, D. M. Online Learning for Latent Dirichlet Allocation. Adv. Neural Inf. Process Syst. 2010, 856− 864. (22) Wallach, H. M.; Murray, I.; Salakhutdinov, R.; Mimno, D. Evaluation Methods for Topic Models. Proc. 26th Annu. Int. Conf. Machine Learning 2009, 1105−1112. (23) Agrawal, A.; Fu, W.; Menzies, T. What is Wrong with Topic Modeling? (and How to Fix it Using Search-based SE). 2016; http:// arxiv.org/abs/1608.08176 (accessed March 10, 2017). (24) Riniker, S.; Fechner, N.; Landrum, G. A. Heterogeneous Classifier Fusion for Ligand-based Virtual Screening: or, How Decision Making by Committee can be a Good Thing. J. Chem. Inf. Model. 2013, 53, 2829−2836. (25) Riniker, S.; Landrum, G. A. Open-source Platform to Benchmark Fingerprints for Ligand-based Virtual Screening. J. Cheminf. 2013, 5, 26. (26) Gaulton, A.; Bellis, L. J.; Bento, A. P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; et al. ChEMBL: a Large-scale Bioactivity Database for Drug Discovery. Nucleic Acids Res. 2012, 40, D1100−D1107. (27) Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825−2830. (28) Degen, J.; Wegscheid-Gerlach, C.; Zaliani, A.; Rarey, M. On the Art of Compiling and Using ’Drug-Like’ Chemical Fragment Spaces. ChemMedChem 2008, 3, 1503−1507. (29) Rogers, D.; Hahn, M. Extended-connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742−754. (30) Landrum, G. A. RDKit: Open-Source Cheminformatics Software, version 2016.03; DOI: 10.5281/zenodo.58441; http:// www.rdkit.org and https://github.com/rdkit/rdkit (accessed July 17, 2016). (31) Petterson, J.; Buntine, W.; Narayanamurthy, S. M.; Caetano, T. S.; Smola, A. J. Word Features for Latent Dirichlet Allocation. Adv. Neural Inf. Process Syst. 2010, 1921−1929. (32) Grant, J. A.; Haigh, J. A.; Pickup, B. T.; Nicholls, A.; Sayle, R. A. Lingos, Finite State Machines, and Fast Similarity Searching. J. Chem. Inf. Model. 2006, 46, 1912−1918. (33) Zachariae, U.; Giordanetto, F.; Leach, A. G. Side Chain Flexibilities in the Human Ether-a-go-go Related Gene Potassium Channel (hERG) Together with Matched-pair Binding Studies Suggest a New Binding Mode for Channel Blockers. J. Med. Chem. 2009, 52, 4266−4276. (34) Garaj, V.; Puccetti, L.; Fasolis, G.; Winum, J. Y.; Montero, J. L.; Scozzafava, A.; Vullo, D.; Innocenti, A.; Supuran, C. T. Carbonic

Further plots and details discussed in this study (PDF) Jupyter notebooks to evaluate the new method described in this study (ZIP) Data sets A and B described in this study (ZIP)

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Nadine Schneider: 0000-0001-5824-2764 Gregory A. Landrum: 0000-0001-6279-4481 Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS The authors thank Finton Sirockin and Bernhard Rohde for valuable and critical discussions. The authors thank Anna Pelliccioli and Brian Kelley for critical proofreading of the manuscript. N. Schneider thanks the NIBR Postdoc Program for a Postdoctoral Fellowship.

■ ■

ABBREVIATIONS FP, fingerprint; LDA, latent dirichlet allocation REFERENCES

(1) Hofmann, T. Unsupervised Learning by Probabilistic Latent Semantic Analysis. Mach. Learn. 2001, 42, 177−196. (2) Blei, D. M.; Ng, A. Y.; Jordan, M. I. Latent Dirichlet Allocation. JMLR 2003, 3, 993−1022. (3) Blei, D. M. Probabilistic Topic Models. Commun. ACM 2012, 55, 77−84. (4) Pritchard, J.; Stephens, M.; Donnelly, P. Inference of Population Structure Using Multilocus Genotype Data. Genetics 2000, 155, 945− 959. (5) Teh, Y. W.; Jordan, M. I.; Beal, M. J.; Blei, D. M. Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes. Adv. Neural Inf. Process Syst. 2004, 1385−1392. (6) Bart, E.; Welling, M.; Perona, P. Unsupervised Organization of Image Collections: Taxonomies and Beyond. Trans. Pattern Recognit. Mach. Intell. 2011, 33, 2302−2315. (7) Zhao, W.; Zou, W.; Chen, J. J. Topic Modeling for Cluster Analysis of Large Biological and Medical Datasets. BMC Bioinf. 2014, 15, S11. (8) Hoffman, M. D.; Blei, D. M.; Wang, C.; Paisley, J. W. Stochastic Variational Inference. J. Mach. Learn. Res. 2013, 14, 1303−1347. (9) Zhao, W.; Chen, J. J.; Perkins, R.; Liu, Z.; Ge, W.; Ding, Y.; Zou, W. A Heuristic Approach to Determine an Appropriate Number of Topics in Topic Modeling. BMC Bioinf. 2015, 16, S8. (10) Wang, V.; Xi, L.; Enayetallah, A.; Fauman, E.; Ziemek, D. GeneTopics - Interpretation of Gene Sets via Literature-driven Topic Models. BMC Syst. Biol. 2013, 7, S10. (11) Liu, L.; Tang, L.; Dong, W.; Yao, S.; Zhou, W. An Overview of Topic Modeling and its Current Applications in Bioinformatics. SpringerPlus 2016, 5, 1608. (12) Irwin, J. J.; Sterling, T.; Mysinger, M. M.; Bolstad, E. S.; Coleman, R. G. ZINC: A Free Tool to Discover Chemistry for Biology. J. Chem. Inf. Model. 2012, 52, 1757−1768. (13) Kim, S.; Thiessen, P. A.; Bolton, E. E.; Chen, J.; Fu, G.; Gindulyte, A.; Han, L.; He, J.; He, S.; Shoemaker, B. A.; et al. PubChem Substance and Compound Databases. Nucleic Acids Res. 2016, 44, D1202−13. (14) Bento, P. A.; Gaulton, A.; Hersey, A.; Bellis, L. J.; Chambers, J.; Davies, M.; Krüger, F. A.; Light, Y.; Mak, L.; McGlinchey, S.; et al. The ChEMBL Bioactivity Database: An Update. Nucleic Acids Res. 2014, 42, D1083−D1090. O

DOI: 10.1021/acs.jcim.7b00249 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling Anhydrase Inhibitors: Novel Sulfonamides Incorporating 1, 3, 5triazine Moieties as Inhibitors of the Cytosolic and Tumor-associated Carbonic Anhydrase Isozymes I, II and IX. Bioorg. Med. Chem. Lett. 2005, 15, 3102−3108. (35) Carta, F.; Garaj, V.; Maresca, A.; Wagner, J.; Avvaru, B. S.; Robbins, A. H.; Scozzafava, A.; McKenna, R.; Supuran, C. T. Sulfonamides Incorporating 1, 3, 5-triazine Moieties Selectively and Potently Inhibit Carbonic Anhydrase Transmembrane Isoforms IX, XII and XIV over Cytosolic Isoforms I and II: Solution and X-ray Crystallographic Studies. Bioorg. Med. Chem. 2011, 19, 3105−3119. (36) Winum, J. Y.; Pastorekova, S.; Jakubickova, L.; Montero, J. L.; Scozzafava, A.; Pastorek, J.; Vullo, D.; Innocenti, A.; Supuran, C. T. Carbonic Anhydrase Inhibitors: Synthesis and Inhibition of Cytosolic/ Tumor-associated Carbonic Anhydrase Isozymes I, II, and IX with Bissulfamates. Bioorg. Med. Chem. Lett. 2005, 15, 579−584. (37) Ö zensoy, Ö .; Puccetti, L.; Fasolis, G.; Arslan, O.; Scozzafava, A.; Supuran, C. T. Carbonic Anhydrase Inhibitors: Inhibition of the Tumor-associated Isozymes IX and XII with a Library of Aromatic and Heteroaromatic Sulfonamides. Bioorg. Med. Chem. Lett. 2005, 15, 4862−4866. (38) Wallach, H. M. Topic Modeling: Beyond Bag-of-words. Proc. 23rd Int. Conf. Machine Learning 2006, 977−984. (39) Hastings, J.; de Matos, P.; Dekker, A.; Ennis, M.; Harsha, B.; Kale, N.; Muthukrishnan, V.; Owen, G.; Turner, S.; Williams, M.; et al. The ChEBI Reference Database and Ontology for Biologically Relevant Chemistry: Enhancements for 2013. Nucleic Acids Res. 2013, 41, D456−D463.

P

DOI: 10.1021/acs.jcim.7b00249 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX