Chemical Topic Modeling: Exploring Molecular Data Sets Using a

Jul 17, 2017 - In order to find an intuitive and meaningful approach to organizing large molecular data sets, we adopted a probabilistic framework cal...
2 downloads 12 Views 5MB Size
Subscriber access provided by UNIV OF NEWCASTLE

Article

Chemical topic modeling: Exploring molecular datasets using a common text-mining approach Nadine Schneider, Nikolas Fechner, Gregory A. Landrum, and Nikolaus Stiefl J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.7b00249 • Publication Date (Web): 17 Jul 2017 Downloaded from http://pubs.acs.org on July 18, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 54

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Chemical Topic Modeling: Exploring Molecular Datasets using a Common Text-mining Approach Nadine Schneider1*, Nikolas Fechner1, Gregory A. Landrum2, Nikolaus Stiefl1 1

Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, 4002 Basel, Switzerland; 2

KNIME.com AG, Technoparkstr 1, 8005 Zurich, Switzerland

* E-mail: [email protected]

1 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

KEYWORDS Latent Dirichlet Allocation, RDKit, Chemical Fingerprints, BRICS, ChEMBL, Clustering

ABSTRACT

Big data is one of the key transformative factors which are increasingly influencing all aspects of modern life. Although this transformation brings vast opportunities it also generates novel challenges, not the least of which is organizing and searching this data deluge. The field of medicinal chemistry is not different: more and more data are being generated, for instance by technologies such as DNA encoded libraries, peptide libraries, text mining of large literature corpora, and new in silico enumeration methods. Handling those huge sets of molecules effectively is quite challenging and requires compromises that often come at the expense of the interpretability of the results. In order to find an intuitive and meaningful approach to organizing large molecular datasets, we adopted a probabilistic framework called “topic modeling” from the text-mining field. Here we present the first chemistry-related implementation of this method, which allows large molecule sets to be assigned to “chemical topics” and investigating the relationships between those. In this first study, we thoroughly evaluate this novel method in different experiments and discuss both its disadvantages and advantages. We show very promising results in reproducing human-assigned concepts by using the approach to identify and retrieve chemical series from sets of molecules. We have also created an intuitive visualization of the chemical topics output by the algorithm. This is a huge benefit compared to other unsupervised machine-learning methods, like clustering, which are commonly used to group sets of molecules. Finally, we applied the new method to the 1.6 million molecules of the ChEMBL22 dataset to test its robustness and efficiency. In about 1h we built a 100-topic model of this large dataset in which we could identify interesting topics like “proteins”, “DNA” or “steroids”. Along with this publication we provide our datasets and an open-source implementation of the new method (CheTo) which will be part of an upcoming version of the open-source cheminformatics toolkit RDKit.

2 ACS Paragon Plus Environment

Page 2 of 54

Page 3 of 54

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

INTRODUCTION

The amount of data we produce every day is far beyond human processing capacity. Additionally, most of this data is unstructured and unlabeled, making it very difficult for us to gain insights from it. Primarily developed in the text-mining field, topic models 1, 2 can help to uncover the hidden structure in a data set and allow us to detect connections within the data which might otherwise be overlooked. First mainly applied to texts, topic modeling is today successfully applied in a wide area of different disciplines ranging from text-mining and social sciences to computer vision and biology.3-7

Topic modeling is a probabilistic framework originally developed to extract the hidden thematic structure of a collection of text documents. For example, models have been built on large text corpora like 1.8 million articles from The New York Times or 300,000 publications from Nature.8 Here, the extracted topics can be subsumed, for example, as “Music”, “Literature”, or “Art” for The New York Times and as “Genetics”, “Neural science”, or “Astronomy” for Nature. The extracted thematic structure is quite compatible with the concepts humans would use to organize the texts. This is a major advantage compared to other methods like clustering which can be difficult to interpret. The simplest description of topic models is that they detect words that tend to co-occur across a large number of documents. These words then represent a topic. So, the result of a topic model is not a list of “named” topics but a set of words for each topic. The model assigns probabilities to the words, but the meaning of the topics is assigned by humans. Furthermore, one basic idea in topic modeling is that documents do not exhibit only one certain topic but comprise a mixture of topics. This mixture or distribution of topics is also provided by the model for each document so that we can search for similar documents based on their thematic compositions.

This way of modeling data allows us to explore datasets in a novel, very intuitive way and has been successfully employed in many other areas and applied to other data than text documents. For example, in

3 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 54

systems biology or bioinformatics it is used for exploring microarray data and medical datasets or for interpreting gene sets.7,

9, 10

Zhao and coworker found that the topic models could retrieve the correct

grouping into serotypes of bacteria (~41,000 samples and 20 different serotypes) and achieve more accurate results than conventional clustering methods in dividing cancer data into subtypes (111 samples and 2 subtypes).7 Other examples concerning the use of topic models in the context of bioinformatics were recently reviewed by Liu et al..

11

These and other studies show that topic modeling could be

transferred to other types of data than text and that it is a promising alternative to traditional data analysis methods like clustering.

Due to the massive amount of chemical structures found in databases, like ZINC,12 PubChem ChEMBL

14

and due to novel technologies, like DNA encoded libraries

15

13

or

that generate billions of novel

molecules, alternative approaches to organize and explore this data are highly sought after. Often clustering approaches like K-Means were applied to group compounds in smaller sets. These algorithms usually rely on a similarity measure to divide the molecules into clusters. Defining a similarity between compounds is often not straightforward and in addition, the resulting clusters are often hard to interpret. Though topic modeling does not require a similarity measure and allows interpretability, we do need an algorithm to transform the molecules such that they are compatible with the approach. One such transformation of molecular data to make it suitable for use in a text-mining algorithm has been proposed by Hull and colleagues

16

when they developed Latent Semantic Structure Indexing (LaSSI), which

extends Latent Semantic Indexing (LSI)

17

to molecular data. In contrast to topic modeling, LSI is based

on a singular value decomposition of the input matrix and is not a probabilistic model. The resulting model is less flexible than a topic model and leads to a linear subspace that captures most of the variance in the data. In LaSSI a descriptor-molecule matrix equivalent to the word-document matrix in texts is created from the compound set. Molecular fingerprints with counts were used as the descriptors in their study. LaSSI has been successfully applied in virtual screening and was demonstrated to be superior to common similarity searches 18, 19 in some cases. 4 ACS Paragon Plus Environment

Page 5 of 54

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

In this publication, we present the first chemistry related implementation of topic modeling to organizing large molecule sets into “chemical topics”. A chemical topic can be seen as a pattern of co-occurring fragments that recurs across a set of molecules. The focus of this study is on evaluating the performance, interpretability and robustness of the new method. As with all unsupervised learning methods it is difficult to quantitatively measure the performance of a topic model per se. To mitigate this limitation, we used labeled data in our experiments and quantified how well the model could reconstruct series of chemical compounds from a set of molecules. This experiment also allows us to assess the interpretability of a chemical topic model: does the model recover human concepts (the chemical series) from the data. Similar evaluation methods are used in the field of text mining where experiments like word intrusion into a topic or topic intrusion into a document are conducted to quantitatively evaluate how well the inferred topics match human concepts .20 The evaluation of the quality of topic models is an active field of research in which additional aspects like the applicability to novel data model

23

21, 22

and the stability of a topic

have been investigated. Since one of the major advantages of topic modeling is the

interpretability of the models, we were also interested in the robustness and stability of a chemical topic model in order to assess how reliable the approach could be for gaining reliable and reproducible insights.

The paper is organized as follows: in the first section we present our datasets, more background on the topic modeling approach, and a detailed description of our implementation. Next, we thoroughly evaluate several aspects of the new method with several experiments. Finally, we discuss the results and give an outlook to future perspectives and applications of this novel approach.

METHODS AND MATERIALS In this section we first introduce the datasets we used and constructed for this study. In the second part the implementation of the chemical topic model is described. Finally, the evaluation methods we used in our study are presented.

5 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 54

Dataset. In our study we used the dataset II Riniker et al. published in their fingerprint-based virtual screening study. 24, 25 This dataset was extracted from the ChEMBL database 26, 14 which mainly contains compounds extracted from scientific literature. A typical medicinal-chemistry paper usually includes data on one or two chemical series and sometimes reference compounds which might be structurally different. These chemical series will be the “chemical topics” that we try to retrieve. Riniker and co-workers constructed the dataset as follows: from publications of 50 protein targets which have been identified as being difficult for virtual screening they selected only those that contain at least ten active compounds. In addition, targets with less than four publications were discarded from their dataset. Finally, their dataset consists of 37 targets with 4 - 37 papers and 10 - 112 active compounds per paper. Here we further constrained this dataset to publications with at least 20 active compounds to obtain larger and more meaningful chemical series. This finally results in 36 targets with 1 – 26 papers. We constructed two datasets out of these papers: dataset A consists of 36 different papers, one randomly chosen for each target. Six compounds were identified to be non-unique for a single target but appeared multiple times for different isoforms of carbonic anhydrase. Those ambiguous compounds were omitted in our dataset resulting in 880 different compounds (see Table S1 for detailed information). A second dataset (dataset B) was constructed by selecting five pharmaceutically relevant targets from the 36 targets of dataset A which cover different target classes. For those we extracted all publications from the ChEMBL 22 database. All compounds with a measured Ki, IC50, EC50 or AC50 were included. Molecules which occur in more than one publication and compounds which have more than one activity value assigned were removed from the dataset to avoid ambiguities. Papers with less than 15 different compounds were also excluded from the final dataset. Table 1 summarizes the final dataset. Table 1. Summary of dataset B. Target Carbonic anhydrase II Dopamine D2 receptor MAP kinase p38 alpha

ChEMBL target ID

# papers

# molecules

15 72 10188

38 47 31

887 1856 959

6 ACS Paragon Plus Environment

Page 7 of 54

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Dipeptidyl peptidase IV Cathepsin S

11140 11534

43 27

1330 855

Both datasets (A and B) are given as csv files in the Supplementary information. Finally, we employed the whole ChEMBL 22 dataset as a large-scale dataset to test the scalability of our new method. This dataset consists of about 1.6 million unique compounds that were found for about 11,000 different targets and extracted from more than 65,000 papers. The dataset was used as is; we did not preprocess the molecules since we just wanted to assess the efficiency, robustness, and runtime of the novel method. Topic modeling: One of the most widely applied and simplest algorithms for topic modeling is the Latent Dirichlet Allocation (LDA) which was developed by Blei and coworkers.2 This algorithm is also the basis of our chemical topic model and we briefly summarize it here. More background and algorithmic details can be found in the original publications.2,

3, 21, 8

LDA is a Bayesian probabilistic model. It models a

generative process where documents arise from an imaginary random process in which each document is produced from a set of topics and these topics are distributions over a fixed vocabulary. So, in a first step a distribution over topics is randomly chosen for a document. Then, from those topics words are selected by sampling from the respective word distribution. An important part of the concept is that LDA is a mixed-membership model: documents exhibit different topics in different proportions. In reality, only the documents or the distribution over words are observed and the thematic structure of the documents, as implied by the topic distribution, is hidden. LDA provides a tool to infer this hidden thematic structure by estimating a likely posterior distribution of the hidden variables. Those hidden variables or the hidden topical structure can be described more formally: for a corpus of multiple documents D, K different topics are assumed. Each topic k represents a multinomial distribution βk over a fixed vocabulary and is drawn from a Dirichlet distribution. Each document d exhibits a topic distribution θd which is also drawn from a Dirichlet. For each word n of the vocabulary a topic assignment zd,n exists 7 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

in a document d. Finally, given the observed variable – the distribution of words wd for each document d – the posterior probability can be calculated as:

 : , : ,  : | :  =

: ,: ,: ,:  : 

(see reference 3)

Here, the nominator is the joint probability of the random variables and the denominator is the marginal probability of the observed variables that is equal to the probability of obtaining this corpus given any topic model. Since the latter is intractable to compute exactly, different methods have been developed to efficiently approximate it (see for example references 2, 3). We use the scikit-learn 27 LDA implementation which is based on an online variational Bayes algorithm.8, 21

We adapted the following parameters for our chemical topic models: number of topics, learning method

(we use ‘batch’ as default), maximum number of iterations for the optimization to converge (increased to 100) and the random state (to obtain a reproducible model). The resulting model contains the distribution of topics over documents and the distribution of words over topics. We normalize both matrices by the sum of each row so that we obtain a probability distribution of topics for each document and a probability distribution of words for each topic. This allows investigating the most probable words for a topic, interpreting the meaning of a topic and finally assigning names or labels to the topics. In the following we describe how we adapted the topic modeling approach to chemical data, thereby enabling the organization of collections of chemical compounds into chemical topics.

8 ACS Paragon Plus Environment

Page 8 of 54

Page 9 of 54

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 1. Chemical topic modeling workflow. In gray text the terms used in the context of topic modeling of text documents are shown to make the connection to chemical topic modeling.

Chemical topic modeling. In our chemical topic modeling approach we have defined molecules as documents and the substructures or fragments derived from these molecules as words (see Figure 1). Those fragments could be generated using different fragment approaches such as chemical fingerprints, rule-based bond-cutting methods like BRICS,28 or pre-defined substructures. A detailed description of the fragment generation can be found below. After generating the fragments, a matrix is constructed in which the rows represent the molecules and the columns contain the counts for each fragment occurring in the respective molecule. Finally, this matrix can be filtered by excluding the most common and rare fragments. Common fragments are defined as fragments occurring in more than 10% of the molecules; rare fragments are those that are found in less than 1% of the compounds (0.1% for the larger datasets). We show results using both an unfiltered and filtered molecule-fragment matrix. This matrix - the corpus of our molecule dataset - is used as input for the LDA algorithm. The final LDA model returns two matrices: the topic-fragment matrix and the molecule-topic matrix. Based on these matrices the topics along with the most probable fragments per topic can be retrieved and visualized. For the molecules, a 9 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 54

topic profile – showing the probabilities of a molecule to be associated with a certain topic - can be extracted or alternatively, the most likely topic can be assigned to the molecules. In the following the fragment generation, fragment filtering and the visualization of topics are explained in more detail.

Figure 2. Fingerprint-based fragments. Left: Generation of Morgan FP fragments with radius two. Right: Generation of RDKit FP fragments with a path length between three and five bonds. Linear and branched paths are possible. Gray circles: atoms in aliphatic rings. Yellow circles: atoms in aromatic rings. Gray lines: bonds to atoms which are not part of the fragment. In the Morgan FP those atoms are implicitly included in the invariant; in the RDKit FP they are ignored. Dotted lines: aromatic bonds.

Fragment generation: we used three different ways to fragment the molecules for the LDA: circular Morgan fingerprints,29 path-based RDKit fingerprints,30 and BRICS fragments.28 This allows investigating the influence of fragment size and shape (smaller and overlapping fragments from the fingerprints vs. larger fragments from the BRICS approach). For the Morgan fingerprints we chose a radius of two bonds and we excluded all fingerprint bits associated with a radius less than two (compare Figure 2 left). Additionally, we adapted the standard Morgan atom invariant to make it less specific: our invariant does not consider formal charge and isotopes but does include information about the aromaticity of an atom. In the case of the RDKit fingerprint the minimum path length (number of bonds) was set to three bonds while the maximum length was set to five (Figure 2 right). Here the usual invariant is used which only considers atomic number, atom degree and aromaticity of the atoms. Using the BRICS rules, which

10 ACS Paragon Plus Environment

Page 11 of 54

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

encode rules for breaking molecules into retro-synthetically interesting chemical substructures,28 to cut the molecules in fragments we obtained a different set of fragments. Those are in general larger and more importantly - not overlapping like the fragments generated using the fingerprint methods. In Figure 3 fragments obtained with the three different methods are shown to highlight how different the fragments from the approaches are.

Figure 3. Exemplary fragments derived from random molecules using the three different fragment approaches. Gray circles: aliphatic ring atoms. Yellow circles: aromatic ring atoms. Gray lines: neighboring atoms not directly considered for the fragment. Dotted lines: aromatic bonds.

To avoid bit collisions caused by hashing, the size of the fingerprints wasn’t restricted to a certain size. The fingerprints or BRICS fragments were generated as count vectors using the open-source cheminformatics toolkit RDKit (version 2016.09.2).30 Fragment filtering: The final fragments for each molecule form the vocabulary of our corpus. Herein we filter out the common and rare fragments: all fragments occurring in more than 10% of the molecules in the dataset will be ignored as will fragments appearing in less than 0.1% (datasets with > 1000 molecules) or 1% (smaller datasets). These parameters can be adjusted according to the dataset or the objective of the

11 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 54

experiment. This filtering idea was also adopted from the text-mining field where it is usually done as a preprocessing step when building a topic model. Here, texts are filtered to remove ‘stop words’ (terms like ‘and’, ‘not’, etc.) and very rare words.8 The removal of rare words is done for two reasons: efficiency - each element in the vocabulary adds another column to our input and output matrices – and to prevent rare terms from disproportionality influencing the results.31 All of the important parameters of our chemical topic model are summarized in Table 2. One important difference in chemical topic modeling is that the final molecule-fragment matrix – the input matrix for the LDA algorithm – is much sparser than it usually is for text documents. This leads to new challenges such as molecules without features (due to filtering) or the need to include a larger number of topics to obtain a useful model. We discuss these challenges below in the Results section. Table 2. Important parameters for the chemical topic model. Parameter

Value range

Default values

Number of topics

1-1000

-

Threshold rare fragments

0.0-1.0

0.001

Threshold common fragments

0.0-1.0

0.1

Sample dataset size

0.1-1.0

0.1 for big datasets, 1.0 for small datasets

Morgan, RDK, BRICS

Morgan

Fragment type

Visualization: As described above the topic modelling algorithm generates a topic-fragment probability matrix (see Figure 1) containing the probabilities that each fragment is associated with a certain topic. This matrix allows intuitive visualization of the chemical topic model: topics can be directly highlighted within compound structures. The most probable fragments of a topic can help to quickly identify topics that are structurally interesting. This is a distinct advantage of the topic modeling approach compared to clustering methods like K-Means where interpretation is difficult because the clusters are defined solely by the compounds that compose them up and their similarities. 12 ACS Paragon Plus Environment

Page 13 of 54

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 4 shows an example visualization of a chemical topic model. In the top of Figure 4 a histogram indicating the fraction of molecules associated with a certain topic can be found. This helps in assessing which of the topics are more general (a higher proportion of molecules usually fell into those) and which are more specific. Furthermore, the molecules assigned with a high probability to a certain topic can be analyzed and the topic (the fragments that are most likely associated with that topic) can be highlighted within the structure of the compounds. In the example in Figure 4 the dichloro-phenyl ring attached to the 2-chinolinone and the N-tert-butyl substituted piperidine ring constitute the major motifs in topic 21. This could also be assumed by investigating the top three fragments of this topic (see Figure 4 top left).

Figure 4. Topic model visualization. Top: Example visualization of a 60-topic model of dataset A. For each molecule the most probable topic is determined and the molecule is assigned to this topic. The distribution of this assignment is depicted in the histogram indicating the fraction of molecules associated with a certain topic. Top left: Top 3 fragments of topic 21 are shown along with their probability to be associated with this topic. The Morgan FP fragmentation was used to build this model. Bottom: Top four molecules of topic 21 based on their topic probability; within the structures fragments with a high probability for topic 21 are highlighted in light blue (the larger the highlight radius the higher the probability).

13 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 54

The visualization exemplified in Figure 4 is part of our chemical topic model implementation which can be found at GitHub (www.github.com/rdkit/CheTo). Model evaluation. In order to assess whether topic modeling is applicable to and useful for chemical data, we designed several experiments: in the first experiment, which is a kind of proof of concept, we evaluated if the method is able to extract “reasonable” topics from a set of compounds. For this we used a dataset containing various chemical series which were developed for 36 different protein targets (dataset A, see above and Table S1). For each of the 36 targets we determined the major topic – the topic which is assigned to the majority of the compounds of a certain series – and calculated the recall and the precision in this major topic. The recall measures how many of the compounds of a series could be retrieved in the major topic of a target; the precision denotes how many other compounds from different series land in the major topic of a certain target. This experiment enables us to quantitatively evaluate how well the inferred topics match the concepts of medicinal chemists. It is similar to the word/topic intrusion experiment proposed by Blei and coworkers 20 and is in a way the reverse experiment to that: the chemical series is a human-assigned concept and we check how well the topic model recapitulates it and if the model channels “intruder compounds” into a series. Additionally, we use this setup to explore and optimize different parameters important for the topic model: the number of topics and the fragment types/filtering (compare Table 2). Furthermore, we evaluated the topic stability by testing whether or not the most probable fragments of the major topics of the different targets stay the same across ten different LDA runs, i.e. changing the random number seed for the LDA in each of the runs. This is an important point if we use the model to interpret and to learn from our data. In order to quantify stability we calculated the mean pairwise Tanimoto similarity of the fragment probability vectors resulting from ten different runs in the major topics of the targets respectively. We used the following version of Tanimoto for continuous values proposed by Sayle and coworkers 32:

,  =

∑# " min" , "  # ∑" max" , "  14

ACS Paragon Plus Environment

Page 15 of 54

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

The final experiment was to measure the runtime efficiency of the method in a large-scale experiment of generating a topic model for the 1.6 million molecules of the whole ChEMBL 22 dataset. RESULTS In the following we first investigate the behavior of our novel approach by assessing its ability to identify compound series in a set of molecules. We determine the number of topics necessary to obtain a good separation of the molecules. Furthermore, we investigate the influence of the different fragmentation methods and filters on the results. The stability of the topics – the composition of a topic concerning the most probable fragments – is analyzed and discussed in the next section. Finally, an application of the method to organize larger sets of molecules is presented. This is complemented by runtime measurements of the method on the ChEMBL 22 dataset. Different fragmentation methods and number of topics. To create a chemical topic model the compounds of the dataset first need to be fragmented. Here we have tried three different methods: Morgan and RDKit fingerprints as well as BRICS bond cutting rules. Those methods generate different types of fragments as described above (compare also Figure 3). Furthermore, we analyzed the influence of filtering those fragments to remove common and rare fragments. Table 3 summarizes the results of fragmenting the molecules of dataset A. Table 3. Results of different fragmentation methods and filtering of dataset A.

filtered (thresholds: 0.1 common frag; 0.01 rare frag)

unfiltered Method #fragments

mean # different fragments per molecule

#fragments

mean # different fragments per molecule

# molecules w/o fragments a)

Morgan FP

2839

21.4

554

14.7

3 (0.34%)

RDK FP

5620

170

2276

79

1 (0.11%)

85

2.4

44 (5%)

BRICS 553 5.2 a) In parentheses: percent of the dataset (880 molecules).

15 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 54

From Table 3 it is clear that filtering out common fragments and rare fragments dramatically reduces the vocabulary for our topic model. The major effect comes from filtering out rare fragments, which leads to a reduction of 53 % - 83 % in the number of fragments. A disadvantage of the filtering is that we lose a few molecules which only consist of these rare fragments (compare Table 3 last column). Next, after generating molecule-fragment occurrence matrices a topic model can be built by choosing the desired number of topics. In our first experiment - the retrieval of the 36 compound series of dataset A the expected number of topics is almost predetermined. However, it needs to be considered that a medicinal chemistry publication might contain more than one compound series or reference compounds which are not members of the series. Furthermore, some targets in dataset A, such as the carbonic anhydrases, are related to each other; this may mean that some of the targets/publications contain compounds from very similar series. Due to these reasons and the fact that we would like to explore the chemical topic model more deeply, we ran the model with different parameter settings: 10 to 100 topics, the three different fragment methods and ten different seeds for the LDA algorithm. Figure 5 shows the result of this experiment: the mean overall recall and precision of ten different LDA runs.

16 ACS Paragon Plus Environment

Page 17 of 54

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 5. Selection of the number of topics for dataset A. Mean recall and precision for ten runs were chosen as criteria. The median recall/precision of each run was derived from the recall/precision of the major topic (= largest fraction molecules of the series were assigned to that topic) for the 36 different compound series. The results were averaged over ten different runs using different seeds for the LDA algorithm. The circles show the mean value, shaded area highlights the standard deviation. Right: results using an unfiltered molecule-fragment matrix. Left: results using a filtered molecule-fragment matrix (rare fragments = 0.01, common fragments = 0.1).

In general, the best performing topic models in terms of overall recall and precision for the 36 compound series are obtained by choosing between 50 and 60 topics. The FP-based fragmentation methods – Morgan and RDKit FP – perform better than using the BRICS fragments. In addition, the performance of the model did not change significantly when using the unfiltered matrices for Morgan or RDKit fragments (compare Figure 5 left). This is different for BRICS fragments where a measureable increase in recall can be found (see Figure 5 top) when using the unfiltered fragment matrix. This is not surprising given the 17 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 54

small number of BRICS fragments remaining after filtering (see Table 3) which implies that our filtering criteria are too strict for the BRICS fragments. In summary, all the three fragment methods show promising results in this experiment: achieving a mean recall between 90-95 % and a mean precision in the same range by choosing the appropriate number of topics. Selecting exactly 36 topics to retrieve the 36 compound series would have still resulted in an excellent recall but only in a medium precision of the model (see Figure 5). This is because the LDA method rather optimizes the consolidation of cooccurrences of words across documents to topics than the separation of documents. Due to our evaluation criteria, choosing less than 36 topics necessarily lead to a weak performance in precision. Detailed analysis of a chemical topic model. Next, we investigated the results in more detail: based on the first experiment we selected a topic model with 60 topics and a filtered Morgan FP fragment matrix as input (see Figure 5 right middle). Again, ten LDA runs with different seeds were carried out since the results of the topic model may vary with the start initialization of the document-topic and the topicfragment matrix. Figure 6 shows box plots of the results of this experiment: for each of the 36 compound series the number of topics, the recall and the precision in the major topic of a compound series were plotted. More detailed results can also be found in Table S2 in the supplementary.

18 ACS Paragon Plus Environment

Page 19 of 54

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 6. Topic model of dataset A using 60 topics and the Morgan FP to fragment the molecules. The y-axis shows the 36 targets of dataset A. For each molecule in each of the 36 chemical series the assignment to its most probable topic was used to obtain the number of different topics and to calculate the recall and the precision in the major topic (= largest fraction molecules of the series were assigned to that topic). Fragments were filtered (see text). The box plots show the variation of ten different runs using different seeds for the LDA algorithm. Red line: median. Box: lower and upper quartile (Q1, Q3). Whiskers: most extreme, non-outlier data points (Q1-1.5*(Q3-Q1) and Q3+1.5*(Q3-Q1)). Plusses: outlier.

One third of the compound series obtain an excellent mean recall and precision of over 90% (see for example ‘Sphingosine 1-phosphate receptor Edg-1’, ‘Beta-2 adrenergic receptor’ or ‘MAP kinase p38 alpha’ in Figure 6). The median mean-recall of 92 % and the median mean-precision of 94 % show that the chemical topic model retrieves most of the chemical series. The worst performance was achieved in the ‘HERG’ series; this could be expected for this known anti-target. Here, no chemical series is described in the original publication 33 (see very low pairwise Tanimoto similarity in Table S2); it is just a collection of HERG active molecules so that the topic model could not identify a common structural motif for those molecules and classified them to many different topics (up to 13 different ones). Another set of 19 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 54

chemical series showing only a medium performance belong to the carbonic anhydrases. These series all share a sulfonamide group attached to an aromatic ring - the typical zinc-binder motif of those inhibitors – but nothing more specific. The ‘carbonic anhydrase IX’ series is especially problematic: most of the molecules in this series are rather small and primarily consist of that motif. Due to the similarity between the carbonic anhydrase series, many share a major topic in the chemical topic model (see Figure 7). This leads to a worse performance in precision and could also result in a smaller recall at the same time (compare Figure 6).

Figure 7. Compound series sharing major topics. Compound series of dataset A which share their major topic within a topic model (=one column: Run 1 – Run 10) are highlighted in the same color. Each column shows the result of an independent run. Compound series colored in white/no color can be found in their own major topic. Number of topics chosen: 60. Fragmentation method: Morgan FP. Input: filtered molecule-fragment matrix.

20 ACS Paragon Plus Environment

Page 21 of 54

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Being part of the same topic could also happen for other series which share common substructures. For example, the ‘Muscarinic acetylcholine receptor 1’ series and the ‘Cytochrome P450 19A1’ inhibitors: most of the latter contain a terminal biphenyl moiety which can also be found in about one third of the compounds of the muscarinic acetylcholine receptor 1 series. In one of our models (see Figure 7, run 5) they share their major topic which leads to a medium performance of recall and precision for both series. Another example is ‘Melanin-concentrating hormone receptor 1’ series and the ‘Serotonin 2a (5-HT2a) receptor’ series, those share a topic in the last run (see Figure 7). Here, the combination of two less common fragments – a tertiary amine in an aliphatic ring and a trifluoromethylphenyl – which appears in most of the melanin-concentrating hormone receptor 1 compounds but only in one of the serotonin 2a (5HT2a) receptor molecules ties those series together. The result in this case is a perfect recall for both series but a medium precision due to sharing the major topic. In summary, an analysis of the chemical series sharing their major topic allows identification of related series which contain similar substructures. This could be especially interesting if the substructure is involved in or critical for binding to the target. Finally, two of the 36 series – Beta secretase 1 and Sphingosine 1-phosphate receptor Edg-1 - are completely stable in all ten runs and achieve a perfect mean recall and precision of 1.0. These two series are both very homogenous and contain a large, specific scaffold which makes them special topics. Figure 8 shows these two special topics. The top five fragments of each of the two topics are depicted. While in the beta secretase topic all of those have an equal probability, in the sphingosine topic the first fragment – a terminal phenyl ring attached to a quaternary carbon in an aliphatic ring (Figure 8 top left) – is two times more likely than the other four fragments. Investigating the top six compounds in each of the topics it becomes clear that these topics are highly specialized in those two chemical series: almost the whole scaffolds of the molecules were described by the topics.

21 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 54

Figure 8. Beta secretase and Sphingosine 1-phosphate receptor Edg-1 topics of the chemical topic model of dataset A. Chemical topic model: 60 topics, filtered Morgan fragment matrix, seed 57. Top: The five most probable fragments of both topics along with their scores/probabilities. The latter vary slightly between the different runs. Bottom: Top six molecules of both topics. All of those have a probability of more than 90 % for their topics. The topic is directly highlighted in turquoise/light orange within the compound structures.

Topic stability. We also investigated the stability of the novel approach in terms of the most probable fragments that compose the topics. This is an important aspect when the chemical topic model is used for interpretation and analysis of data as shown in Figure 8. For this we calculated the Tanimoto similarity of the fragment probability vectors for each topic between different runs. We used the major topic of each series to allow a correct mapping of topics across different runs. When calculating the mean pairwise Tanimoto similarity we only considered the most probable fragments in a topic i.e. those that account for 80 % of the probability in the topic-fragment probability vector. Table 4 summarizes the results of this 22 ACS Paragon Plus Environment

Page 23 of 54

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

experiment when using the Morgan FP fragmentation (results for the other fragment methods can be found in Table S3). Table 4. Topic stability across 10 different LDA runs. Chemical topic model: Morgan fragmentation of dataset A, filtered; 60 topics. The mean pairwise Tanimoto similarity and the standard deviation for the most probable fragments of each topic are listed.

Target Carbonic anhydrase II HERG Carbonic anhydrase I Carbonic anhydrase XII Muscarinic acetylcholine receptor M1 Cytochrome P450 19A1 Cathepsin S Norepinephrine transporter Dopamine D3 receptor 11-beta-hydroxysteroid dehydrogenase 1 Carbonic anhydrase IX Serotonin 2a (5-HT2a) receptor Cannabinoid CB2 receptor Tyrosine-protein kinase SRC Alpha-2a adrenergic receptor Serotonin 2c (5-HT2c) receptor C-C chemokine receptor type 2 Adenosine A1 receptor Cytochrome P450 2D6 Vanilloid receptor Dipeptidyl peptidase IV Melanin-concentrating hormone receptor 1 Cyclooxygenase-2 Histamine H3 receptor Vascular endothelial growth factor receptor 2 Dopamine D2 receptor Cytochrome P450 3A4 MAP kinase p38 alpha Serotonin 1a (5-HT1a) receptor Beta-2 adrenergic receptor Cannabinoid CB1 receptor Glucocorticoid receptor Matrix metalloproteinase-2 Serotonin transporter Beta-secretase 1 Sphingosine 1-phosphate receptor Edg-1

Morgan FP filtered Mean similarity Std dev. 0.13 0.13 0.20 0.23 0.36 0.24 0.47 0.29 0.47 0.21 0.51 0.26 0.52 0.21 0.52 0.29 0.56 0.16 0.58 0.27 0.58 0.20 0.58 0.37 0.60 0.23 0.60 0.21 0.67 0.18 0.67 0.19 0.71 0.14 0.72 0.09 0.72 0.16 0.72 0.12 0.73 0.16 0.73 0.16 0.75 0.21 0.75 0.24 0.79 0.09 0.80 0.10 0.81 0.11 0.81 0.16 0.83 0.14 0.84 0.06 0.84 0.09 0.84 0.11 0.85 0.07 0.85 0.12 0.88 0.05 0.91 0.05

23 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

0.72

Median

Page 24 of 54

0.16

As seen before, the carbonic anhydrase series and the HERG compound series show the lowest mean similarity of the fragment probability vectors in ten different runs. In general, the mean similarity with most of the series above 0.6 is rather high. In order to get a better idea of these values Figure 9 shows the top five fragments and the most probable molecule of the carbonic anhydrase XII topic of ten different LDA runs.

Figure 9. Top five fragments and most probable molecule of the carbonic anhydrase XII topic across ten different LDA runs. Chemical topic model: Morgan FP fragmentation, filtered fragment matrix, 60 topics.

24 ACS Paragon Plus Environment

Page 25 of 54

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

The carbonic anhydrase XII topic was selected since the mean similarity was below 0.5. Nevertheless the chemotype does seem to be fairly consistent by investigating the top five fragments of this topic and the topic itself as highlighted within the structure of the most probable compound. Although the order of the top five fragments changes between different runs and in some of the runs (1, 9, 10) other fragments can be found in the top five, the major substructures representing the carbonic anhydrase XII topic stay the same: the urea group between the two phenyl rings and the sulfonamide moiety. In run 1, 9, and 10 the most probable compound does not belong to the carbonic anhydrase XII series but to the cytochrome P450 3A4 series, which also exhibits this motif of a central urea between two phenyl rings. This experiment shows that even if the composition of a topic in terms of the most probable fragments changes during different runs, the overall meaning of the topic stays rather constant so that the model could be useful for analyzing chemical datasets. Chemical topic modeling for related compounds. In order to explore chemical topic modeling in a slightly different setup we created a second dataset which consists of five subsets of chemical series for five different protein targets (carbonic anhydrase II, Dopamine D2 receptor, MAP kinase p38 alpha, dipeptidyl peptidase IV, and Catepsin). For each of the five targets we collected between 27 and 47 chemical series (see Table 1 for a summary). Those series are more closely related to each other than the series in dataset A. Consequently, this setup is expected to be more challenging for the topic model to reextract the human-assigned concepts (the different chemical series). As in the experiment before, we ran the LDA algorithm with different numbers of topics to find the best compromise of median recall and precision for the five subsets. We chose the Morgan FP fragmentation to build the molecule-fragment input matrix and filtered the fragments using the same thresholds as before (0.1 for common fragments, 0.01 for rare fragments). Table 5 summarizes the results of this experiment and in Figures S1 and S2 further results can be found. Table 5. Final number of topics selected for the subsets in dataset B (best balance between median recall and precision). The performance is shown for two different LDA learning methods: batch and online.

25 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Target

# papers

LDA batch learning # topics selected Median recall

Page 26 of 54

Median precision

Carbonic anhydrase II Dopamine D2 receptor MAP kinase p38 alpha Dipeptidyl peptidase IV Cathepsin S

38 47 31 43 27

60 80 40 70 40 LDA online learning

0.64 0.87 0.88 0.86 0.92

0.65 0.86 0.93 0.91 0.91

Carbonic anhydrase II Dopamine D2 receptor MAP kinase p38 alpha Dipeptidyl peptidase IV Cathepsin S

38 47 31 43 27

80 90 60 70 50

0.73 0.84 1.0 0.86 0.98

0.72 0.82 1.0 0.91 1.0

A topic model able to almost perfectly retrieve the chemical series could be generated for four of the five. The exception is carbonic anhydrase II where the best topic model - in terms of best balance between median recall and precision - achieves only moderate performance. This is due to different reasons: some of the series share a larger scaffold since one is extracted from a follow-up paper of the other (see for example papers with ChEMBL document IDs 20507 and 58463

34, 35

), other series are very generic and

include no special structural features (e.g. ChEMBL document ID 20718 36), and finally, some papers do not contain a series at all but rather a set of molecules (e.g. ChEMBL document ID 30111 37). This latter situation leads to a weak performance in both recall and precision. In contrast, series that are strongly related (like series from follow-up papers) will probably land in the same topic and thereby cause a low precision. Additionally, we used this dataset to compare the performance of the two possible training methods provided by the LDA implementation: batch and online learning. Using batch learning the whole dataset is provided at once to the variational Bayes optimization of the LDA algorithm, while for online learning the model is optimized incrementally by running the optimization with chunks of the dataset. The online approach is necessary when building models on large datasets that cannot be kept in memory at once. Since we would like to be able to apply the chemical topic model to larger sets of molecules, like the

26 ACS Paragon Plus Environment

Page 27 of 54

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

whole ChEMBL database, it is important to determine if there is a substantial difference in performance between the two learning methods. In Table 5 the results of both methods are shown: there is obviously no major difference in the accuracy. It seems that the online method performs slightly better but takes more topics to reach the higher accuracy. A real comparison of those two learning methods is out of scope for this publication but the result that the performance of both methods is similar makes the next experiment feasible: building topic models on the 1.6 million molecules of the ChEMBL 22 dataset. Chemical topic modeling on the whole of ChEMBL 22. Topic modeling has been successfully used in many fields to organize huge collections of documents. To enable that, the document-word matrix needs to be thoroughly filtered as discussed above. Furthermore, online learning is required to handle the huge amount of data and to build a topic model 21. In these final experiments we discuss an efficient strategy we have developed to allow model building on large chemical datasets. For this purpose we used the whole ChEMBL 22 dataset

14

which contains about 1.6 million unique compounds. We started by

determining what fraction of the dataset needs to be considered to build a representative final vocabulary for the chemical topic model. Since the molecule-fragment matrix needs to be filtered at the end to remove infrequent fragments, considering every single molecule to build-up this matrix might not be necessary. Drawing a smaller random sample of the dataset should result in the same or at least a very similar set of fragments after filtering. Figure 10 shows the result of this using the Morgan fragment method and the ChEMBL 22 dataset.

27 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 54

Figure 10. Determination of subsample size for building the vocabulary of the chemical topic model. Dataset: ChEMBL 22. Fragment method: Morgan. Blue line: Number of unique fragments generated depending on the subsample ratio of the data set. Green line: size of the final vocabulary by filtering the fragments (threshold rare fragments: 0.1 %, threshold common fragments: 10 %). Please note the left y-axis uses a log scale. Red line: percent of overlapping vocabulary fragments used to build the vocabulary. All results represent the mean value of five randomly drawn subsets. The standard deviation is shown in shaded area (see beginning of the red line).

This experiment clearly illustrates that subsampling 10% of the dataset will generate an almost identical vocabulary. We found a coverage of about 95% of the fragments in the vocabulary considering only 10% of the data compared to building the vocabulary using all data. The same results have been found for the two other fragment methods (data not shown). This strategy reduces the amount of memory required to store the fragments of each molecule as well as the size of the final molecule-fragment matrix. Building the model using only 10% of the data may also result in a more general model.

Using this approach we fragmented the ChEMBL 22 dataset using the following parameters: subsampling size: 10 %, threshold rare fragments: 0.1 %, threshold common fragments: 10 %, and fragment method: Morgan. With this setting the fragmentation of the whole ChEMBL 22 dataset takes about 30 min on four CPU cores of an Intel Xeon 64-bit 3.6 GHz eight core processor. The final vocabulary 28 ACS Paragon Plus Environment

Page 29 of 54

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

consisted of 2582 Morgan fragments leading to a molecule-fragment matrix of size (1599759, 2582) or 3.85 GB (we use 8bit integers for this count matrix).

To evaluate the model building runtime we generated topic models using between 100 and 500 topics. The runtime was measured for the fitting process of the model and the transformation of the data to the model. To handle the large amount of data we used the online learning method of the LDA implementation and ran it with data chunks of size 5000. Additionally, for these larger models we reduced the number of maximum iterations for the optimization to 10 (for the smaller models we have shown before max_iter was set to 100). Figure 11 shows the result of this experiment.

Figure 11. Runtimes building chemical topic models on the ChEMBL 22 dataset. Blue line: runtime fitting process of the model building. Green line: runtime transforming of the data to the model. Please note that the transforming runtime of the 500-topic model is not available due to memory limitations (16GB RAM).

The runtime for the building of a chemical topic model seems to increase linearly with the number of topics (see reference 2 for time complexity of the LDA method). Fitting a 100-topic model on the whole ChEMBL dataset took about an hour on one CPU on the Intel Xeon 64-bit 3.6 GHz eight core processor. As expected, the runtime for transforming the data is much less: only 5 minutes. An important aspect to consider is also the memory need to handle these huge matrices. Here, an alternative would be to train the

29 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 54

model on a smaller subset of the data (20%) and just apply the model to the other data (experiments on this planned for an upcoming publication).

Finally, we investigated the 100-topic model of the ChEMBL 22 dataset. Some of these topics are very unspecific as their major themes are rather small and contain common substructures like chlorophenyl or pyridine. Other topics in contrast could be entitled as for example “small proteins”, “cyclic peptides”, “steroids”, “morphines” or “DNA” (see Figure 12).

30 ACS Paragon Plus Environment

Page 31 of 54

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 12. 100-topic model of the ChEMBL 22 dataset. Top: Mean topic profile of the ChEMBL 22 molecules. Easily interpretable topics are marked with blue arrows. The bars are colored by the number of fragments important for the topic (the darker blue the more fragments are associated with this topic). Bottom: top eight fragments of four different topics are shown: Cyclic peptides, small proteins, DNA/RNA and steroids. Chemical topic model: 100 topics, Morgan FP fragmentation, filtered matrix (threshold rare fragments: 0.1 %, threshold common fragments: 10 %), seed LDA: 42, sub-sampling size: 10 % of the ChEMBL 22 dataset.

In Figure 12 the results of the 100-topic model of ChEMBL 22 are summarized. We looked into some of these topics in detail and found that for topics which comprise larger molecules like DNA, proteins or cyclic peptides indeed more fragments were associated these topics (darker blue bars in Figure 12 top). Looking at the top eight fragments of, for example the cyclic peptide or the steroids topic, an experienced chemist could most likely derive which molecules will be associated with these topics. So even for this huge and heterogeneous dataset the chemical topic model provides an interesting starting point for further investigation.

DISCUSSION Here, we have introduced chemical topic modeling as a novel method to organize sets of molecules. The method is adopted from the text-mining field, where it has been applied very successfully in many different areas (see Introduction). In this first publication our goal was to investigate how we can apply topic modeling to chemical structures and whether or not the results are useful. Since the method is an unsupervised approach, a quantitative evaluation of performance is challenging. Our approach for this was to measure how well the topic model is able to reproduce human-assigned concepts/groupings, in our case chemical series. In most of the datasets evaluated the method achieved very good results in retrieving chemical series from a set of molecules. In some cases the chemical topic model was not able to reproduce the human concepts, but instead proposed an alternative organization of the molecules. This

31 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 54

highlights subtleties that may not been found using the original grouping of the molecules and might provide an opportunity to change our perspective about the related molecules. As mentioned above, transferring topic modeling from the text-mining field to the chemistry field is not without challenges. The fragmentation of texts and molecules is fundamentally different: for molecules many obvious approaches lead to “overlapping” fragments and thereby to a strong correlation of many of the fragments with each other. This is probably also true for some words in texts (e.g. phrases) but to a lesser extent. In the text-mining field Wallach found that the derived topics are more reasonable when using n-grams of words instead of unigrams.38 Our molecular fragments are not n-grams in the strict sense since they are not necessarily linear, but it can be argued that they share the beneficial property of being overlapping in many cases. This might contribute to the observation that the n-gram-analogous fragments resulting from the Morgan and RDKit fingerprints do seem to work better than the BRICS fragments, which are essentially unigrams. Furthermore, the molecule-fragment matrix is relatively sparse so that filtering out rare and common fragments leads to a more stable and efficient building of the topic model (as shown above) but in some cases results in molecules that have no features. Consequently, appropriate thresholds need to be found for each dataset. Another important difference is the granularity of the topics: in the text-mining field the approach is usually used to organize large sets of documents into an overall thematic structure like texts about politics, art, or economy. In chemistry we are typically more interested in a finer grained organization of the molecules, like the application we showed here: grouping chemical series. For large datasets like ChEMBL we found some more general topics like “proteins”, “DNA” or “steroids”. These general topics invite further investigation since they are sensible and humanly understandable. Obtaining a finer grained organization of those large datasets using a lot more topics in the topic model would induce a much higher complexity of the model, decreasing the feasibility of building such a model. Here, an interesting idea could be the creation of a hierarchical topic model which also has been built and applied in text-mining 5. This way more detailed topics might be found and a kind of ontology for the molecule set is derived. In this regard, we plan to apply the topic modeling to the

32 ACS Paragon Plus Environment

Page 33 of 54

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

ChEBI database

39

which already is mapped into an ontology and see if the topic model can recapitulate

this ontology. Despite these challenges, which need to be investigated in future studies, the new model offers a lot of interesting advantages. In many of the examples above the chemical topic model allows intuitive visualization of the topics mapped directly onto the molecules. Furthermore investigating the top fragments of the topics of a model enables quick identification of “interesting” topics to analyze further. This is a distinct advantage of chemical topic modeling over the usual black-box clustering approaches applied to organize molecules in chemistry. Comprehending a model is a huge benefit for researchers and the ability of the chemical topic model to reproduce human-assigned concepts makes it a great tool to explore sets of molecules. The simple graphical user interface we provide along with this publication (see GitHub [www.github.com/rdkit/CheTo]) is a first showcase/demo of interactively visualizing the chemical topic model. Here, more sophisticated implementations could produce a valuable tool for scientists to learn about the hidden structure in molecule sets and discover alternative relations between molecules. Another advantage of this novel method is that the topic model belongs to the class of mixedmembership models and thereby enables a fuzzy clustering. We have not shown this here in the first study of chemical topic modeling but this property – the topic probability vector of each molecule – can be used as a new descriptor space in similarity searching or in machine learning. Finally, the topic model or LDA describes a generative process which could be used to generate novel molecules: having a chemical topic model of different series of active molecules against a certain target the model could be used to create novel molecules by combining the fragments and the topics. However, as with texts this not straightforward since the model does not know about the grammar or the reaction rules necessary to produce a feasible output. There are several applications in which chemical topic modeling could help to organize and analyze chemical data. One obvious example is the use of the method in organizing HTS results to identify interesting series and scaffolds for hit selection. Especially with the ever increasing cherry picking 33 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 34 of 54

capabilities, initial screens tend to be smaller and a method to successfully select interesting compounds for expansion screens is becoming more important. Another idea could be the derivation of ontologies for large molecule sets or for reactions types. Furthermore, we can also assign labels to the topics based on properties of the molecules found in a certain topic, e.g. using solubility, “soluble” and “in-soluble” topics can be discovered and analyzed to find substructures which may lead to this property. Most of the molecular properties are not induced by a single substructure but by the combination or co-occurrence of those. These combinations might be highlighted be chemical topic model. In conclusion, in this publication we have shown that chemical topic model is a promising new approach, with many challenges and opportunities left to study, in the era of big data and pattern recognition.

ASSOCIATED CONTENT Supporting Information. Additional file 1 containing further plots and details discussed in this study. Additional file 2 containing the Jupyter notebooks to evaluate the new method described in this study. Additional file 3 containing the dataset A and B described in this study. This material is available free of charge via the Internet at http://pubs.acs.org. AUTHOR INFORMATION Corresponding Author * E-mail: [email protected].

34 ACS Paragon Plus Environment

Page 35 of 54

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Author Contributions The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript. ORCID Information N. Schneider: orcid.org/0000-0001-5824-2764 N. Fechner: orcid.org/0000-0003-3852-3950 G.A. Landrum: orcid.org/0000-0001-6279-4481 N. Stiefl: orcid.org/0000-0003-2562-7080

Conflict of Interest The authors declare no competing financial interest. ACKNOWLEDGMENT The authors thank Finton Sirockin and Bernhard Rohde for valuable and critical discussions. The authors thank Anna Pelliccioli and Brian Kelley for critical proofreading of the manuscript. N. Schneider thanks the NIBR Postdoc Program for a Postdoctoral Fellowship. ABBREVIATIONS FP Fingerprint; LDA Latent dirichlet allocation; REFERENCES 1. Hofmann, T. Unsupervised Learning by Probabilistic Latent Semantic Analysis. Mach. Learn. 2001, 42, 177–196.

2. Blei, D. M.; Ng, A. Y.; Jordan, M. I. Latent Dirichlet Allocation. JMLR 2003, 3, 993-1022. 35 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 36 of 54

3. Blei, D. M. Probabilistic Topic Models. Commun. ACM 2012, 55, 77-84. 4. Pritchard, J., Stephens, M., Donnelly, P. Inference of Population Structure Using Multilocus Genotype Data. Genetics 2000, 155, 945–959. 5. Teh, Y. W.; Jordan, M. I.; Beal, M. J.; Blei, D. M. Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes. Adv. Neural Inf. Process Syst. 2004, 1385-1392. 6. Bart, E., Welling, M., Perona, P. Unsupervised Organization of Image Collections: Taxonomies and Beyond. Trans. Pattern Recognit. Mach. Intell. 2010, 33, 2301–2315. 7. Zhao, W.; Zou, W.; Chen, J. J. Topic Modeling for Cluster Analysis of Large Biological and Medical Datasets. BMC Bioinf. 2014, 15, S11. 8. Hoffman, M. D.; Blei, D. M.; Wang, C.; Paisley, J. W. Stochastic Variational Inference. J. Mach. Learn. Res. 2013, 14, 1303-1347. 9. Zhao, W.; Chen, J. J.; Perkins, R.; Liu, Z.; Ge, W.; Ding, Y.; Zou, W. A Heuristic Approach to Determine an Appropriate Number of Topics in Topic Modeling. BMC Bioinf. 2015, 16, S8. 10. Wang, V.; Xi, L.; Enayetallah, A.; Fauman, E.; Ziemek, D. GeneTopics - Interpretation of Gene Sets via Literature-driven Topic Models. BMC Syst. Biol. 2013, 7, S10. 11. Liu, L.; Tang, L.; Dong, W.; Yao, S.; Zhou, W. An Overview of Topic Modeling and its Current Applications in Bioinformatics. SpringerPlus 2016, 5, 1608. 12. Irwin, J. J.; Sterling, T.; Mysinger, M. M.; Bolstad, E. S.; Coleman, R. G. ZINC: A Free Tool to Discover Chemistry for Biology. J. Chem. Inf. Model. 2012, 52, 1757−1768. 13. Kim, S.; Thiessen, P.A.; Bolton, E.E.; Chen, J.; Fu, G.; Gindulyte, A.; Han, L.; He, J.; He, S.; Shoemaker, B.A.; et al. PubChem Substance and Compound Databases. Nucleic Acids Res. 2016, 44, D1202-13. 36 ACS Paragon Plus Environment

Page 37 of 54

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

14. Bento, P. A.; Gaulton, A.; Hersey, A.; Bellis, L. J.; Chambers, J.; Davies, M.; Krüger, F. A.; Light, Y.; Mak, L.; McGlinchey, S.; et al. The ChEMBL Bioactivity Database: An Update. Nucleic Acids Res. 2014, 42, D1083−D1090. 15. Clark, M. A.; Acharya, R. A.; Arico-Muendel, C. C.; Belyanskaya, S. L.; Benjamin, D. R.; Carlson, N. R.; Centrella, P. A.; Chiu, C. H.; Creaser, S. P.; Cuozzo, J. W.; et al. Design, Synthesis and Selection of DNA-encoded Small-molecule Libraries. Nature Chem. Biol. 2009, 5, 647– 654 16. Hull, R. D.; Singh, S. B.; Nachbar, R. B.; Sheridan, R. P.; Kearsley, S. K.; Fluder, E. M. Latent Semantic Structure Indexing (LaSSI) for Defining Chemical Similarity. J. Med. Chem. 2001, 44, 1177-1184. 17. Deerwester, S.; Dumais, S. T.; Furnas, G. W.; Landauer, T. K.; Harshman, R. Indexing by Latent Semantic Analysis. J. Am. Soc. Inf. Sci. 1990, 41, 391 18. Hull, R. D.; Fluder, E. M.; Singh, S. B.; Nachbar, R. B.; Kearsley, S. K.; Sheridan, R. P. Chemical Similarity Searches Using Latent Semantic Structural Indexing (LaSSI) and Comparison to TOPOSIM. J. Med. Chem. 2001, 44, 1185-1191. 19. Singh, S. B.; Sheridan, R. P.; Fluder, E. M.; Hull, R. D. Mining the Chemical Quarry with Joint Chemical Probes: An Application of Latent Semantic Structure Indexing (LaSSI) and TOPOSIM (Dice) to Chemical Database Mining. J. Med. Chem. 2001, 44, 1564-1575. 20. Chang, J.; Boyd-Graber, J. L.; Gerrish, S.; Wang, C.; Blei, D. M. Reading Tea Leaves: How Humans Interpret Topic Models. Adv. Neural Inf. Process Syst. 2009, 31, 1-9. 21. Hoffman, M.; Bach, F. R.; Blei, D. M. Online Learning for Latent Dirichlet Allocation. Adv. Neural Inf. Process Syst. 2010, 856-864.

37 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 38 of 54

22. Wallach, H. M.; Murray, I.; Salakhutdinov, R.; Mimno, D. Evaluation Methods for Topic Models. In Proceedings of the 26th Annual International Conference on Machine Learning, ACM 2009, 1105-1112. 23. Agrawal, A.; Fu, W.; Menzies, T. What is Wrong with Topic Modeling? (and How to Fix it Using Search-based SE), arXiv:1608.08176 2016, available from http://arxiv.org/abs/1608.08176 [accessed March 10, 2017] 24. Riniker, S.; Fechner, N.; Landrum, G. A. Heterogeneous Classifier Fusion for Ligand-based Virtual Screening: or, How Decision Making by Committee can be a Good Thing. J. Chem. Inf. Model. 2013, 53, 2829-2836. 25. Riniker, S.; Landrum, G. A. Open-source Platform to Benchmark Fingerprints for Ligand-based Virtual Screening. J. Cheminf. 2013, 5, 26. 26. Gaulton, A.; Bellis, L. J.; Bento, A. P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; et al. ChEMBL: a Large-scale Bioactivity Database for Drug Discovery. Nucleic Acids Res. 2012, 40, D1100−D1107. 27. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825−2830. 28. Degen, J.; Wegscheid‐Gerlach, C.; Zaliani, A.; Rarey, M. On the Art of Compiling and Using 'Drug‐Like' Chemical Fragment Spaces. ChemMedChem 2008, 3, 1503-1507. 29. Rogers, D.; Hahn, M. Extended-connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742–754.

38 ACS Paragon Plus Environment

Page 39 of 54

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

30. Landrum, G. A. RDKit: Open-Source Cheminformatics Software [Online], version 2016.03. DOI: 10.5281/zenodo.58441. http://www.rdkit.org, and https://github.com/rdkit/rdkit [accessed July 17, 2016]. 31. Petterson, J.; Buntine, W.; Narayanamurthy, S. M.; Caetano, T. S.; Smola, A. J. Word Features for Latent Dirichlet Allocation. Adv. Neural Inf. Process Syst. 2010, 1921-1929. 32. Grant, J. A.; Haigh, J. A.; Pickup, B. T.; Nicholls, A.; Sayle, R. A. Lingos, Finite State Machines, and Fast Similarity Searching. J. Chem. Inf. Model. 2006, 46, 1912-1918. 33. Zachariae, U.; Giordanetto, F.; Leach, A. G. Side Chain Flexibilities in the Human Ether-a-go-go Related Gene Potassium Channel (hERG) Together with Matched-pair Binding Studies Suggest a New Binding Mode for Channel Blockers. J. Med. Chem. 2009, 52, 4266-4276. 34. Garaj, V.; Puccetti, L.; Fasolis, G.; Winum, J. Y.; Montero, J. L.; Scozzafava, A.; Vullo, D.; Innocenti, A.; Supuran, C. T. Carbonic Anhydrase Inhibitors: Novel Sulfonamides Incorporating 1, 3, 5-triazine Moieties as Inhibitors of the Cytosolic and Tumor-associated Carbonic Anhydrase Isozymes I, II and IX. Bioorg. Med. Chem. Lett. 2005, 15, 3102-3108. 35. Carta, F.; Garaj, V.; Maresca, A.; Wagner, J.; Avvaru, B. S.; Robbins, A. H.; Scozzafava, A.; McKenna, R.; Supuran, C. T. Sulfonamides Incorporating 1, 3, 5-triazine Moieties Selectively and Potently Inhibit Carbonic Anhydrase Transmembrane Isoforms IX, XII and XIV over Cytosolic Isoforms I and II: Solution and X-ray Crystallographic Studies. Bioorg. Med. Chem. Lett. 2011, 19, 3105-3119. 36. Winum, J. Y.; Pastorekova, S.; Jakubickova, L.; Montero, J. L.; Scozzafava, A.; Pastorek, J.; Vullo, D.; Innocenti, A.; Supuran, C. T. Carbonic Anhydrase Inhibitors: Synthesis and Inhibition of Cytosolic/Tumor-associated Carbonic Anhydrase Isozymes I, II, and IX with Bis-sulfamates. Bioorg. Med. Chem. Lett. 2005, 15, 579-584.

39 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 40 of 54

37. Özensoy, Ö.; Puccetti, L.; Fasolis, G.; Arslan, O.; Scozzafava, A.; Supuran, C. T. Carbonic Anhydrase Inhibitors: Inhibition of the Tumor-associated Isozymes IX and XII with a Library of Aromatic and Heteroaromatic Sulfonamides. Bioorg. Med. Chem. Lett. 2005, 15, 4862-4866. 38. Wallach, H. M. Topic Modeling: Beyond Bag-of-words. In Proceedings of the 23rd International Conference on Machine Learning, ACM 2006, 977-984. 39. Hastings, J.; de Matos, P.; Dekker, A.; Ennis, M.; Harsha, B.; Kale, N.; Muthukrishnan, V.; Owen, G.; Turner, S.; Williams, M.; et al. The ChEBI Reference Database and Ontology for Biologically Relevant Chemistry: Enhancements for 2013. Nucleic Acids Res. 2013, 41, D456-D463.

40 ACS Paragon Plus Environment

Page 41 of 54

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table of Contents graphic

41 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 1. Chemical topic modeling workflow. In gray text the terms used in the context of topic modeling of text documents are shown to make the connection to chemical topic modeling. 1502x779mm (96 x 96 DPI)

ACS Paragon Plus Environment

Page 42 of 54

Page 43 of 54

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 2. Fingerprint-based fragments. Left: Generation of Morgan FP fragments with radius two. Right: Generation of RDKit FP fragments with a path length between three and five bonds. Linear and branched paths are possible. Gray circles: atoms in aliphatic rings. Yellow circles: atoms in aromatic rings. Gray lines: bonds to atoms which are not part of the fragment. In the Morgan FP those atoms are implicitly included in the invariant; in the RDKit FP they are ignored. Dotted lines: aromatic bonds. 940x559mm (96 x 96 DPI)

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 3. Exemplary fragments derived from random molecules using the three different fragment approaches. Gray circles: aliphatic ring atoms. Yellow circles: aromatic ring atoms. Gray lines: neighboring atoms not directly considered for the fragment. Dotted lines: aromatic bonds. 2036x1027mm (96 x 96 DPI)

ACS Paragon Plus Environment

Page 44 of 54

Page 45 of 54

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 4. Topic model visualization. Top: Example visualization of a 60-topic model of dataset A. For each molecule the most probable topic is determined and the molecule is assigned to this topic. The distribution of this assignment is depicted in the histogram indicating the fraction of molecules associated with a certain topic. Top left: Top 3 fragments of topic 21 are shown along with their probability to be associated with this topic. The Morgan FP fragmentation was used to build this model. Bottom: Top four molecules of topic 21 based on their topic probability; within the structures fragments with a high probability for topic 21 are highlighted in light blue (the larger the highlight radius the higher the probability). 1320x992mm (96 x 96 DPI)

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 5. Selection of the number of topics for dataset A. Mean recall and precision for ten runs were chosen as criteria. The median recall/precision of each run was derived from the recall/precision of the major topic (= largest fraction molecules of the series were assigned to that topic) for the 36 different compound series. The results were averaged over ten different runs using different seeds for the LDA algorithm. The circles show the mean value, shaded area highlights the standard deviation. Right: results using an unfiltered molecule-fragment matrix. Left: results using a filtered molecule-fragment matrix (rare fragments = 0.01, common fragments = 0.1). 1166x1193mm (96 x 96 DPI)

ACS Paragon Plus Environment

Page 46 of 54

Page 47 of 54

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 6. Topic model of dataset A using 60 topics and the Morgan FP to fragment the molecules. The y-axis shows the 36 targets of dataset A. For each molecule in each of the 36 chemical series the assignment to its most probable topic was used to obtain the number of different topics and to calculate the recall and the precision in the major topic (= largest fraction molecules of the series were assigned to that topic). Fragments were filtered (see text). The box plots show the variation of ten different runs using different seeds for the LDA algorithm. Red line: median. Box: lower and upper quartile (Q1, Q3). Whiskers: most extreme, non-outlier data points (Q1-1.5*(Q3-Q1) and Q3+1.5*(Q3-Q1)). Plusses: outlier. 944x708mm (96 x 96 DPI)

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 7. Compound series sharing major topics. Compound series of dataset A which share their major topic within a topic model (=one column: Run 1 – Run 10) are highlighted in the same color. Each column shows the result of an independent run. Compound series colored in white/no color can be found in their own major topic. Number of topics chosen: 60. Fragmentation method: Morgan FP. Input: filtered moleculefragment matrix. 377x547mm (96 x 96 DPI)

ACS Paragon Plus Environment

Page 48 of 54

Page 49 of 54

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 8. Beta secretase and Sphingosine 1-phosphate receptor Edg-1 topics of the chemical topic model of dataset A. Chemical topic model: 60 topics, filtered Morgan fragment matrix, seed 57. Top: The five most probable fragments of both topics along with their scores/probabilities. The latter vary slightly between the different runs. Bottom: Top six molecules of both topics. All of those have a probability of more than 90 % for their topics. The topic is directly highlighted in turquoise/light orange within the compound structures. 1230x911mm (96 x 96 DPI)

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 9. Top five fragments and most probable molecule of the carbonic anhydrase XII topic across ten different LDA runs. Chemical topic model: Morgan FP fragmentation, filtered fragment matrix, 60 topics. 1418x1642mm (96 x 96 DPI)

ACS Paragon Plus Environment

Page 50 of 54

Page 51 of 54

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 10. Determination of subsample size for building the vocabulary of the chemical topic model. Dataset: ChEMBL 22. Fragment method: Morgan. Blue line: Number of unique fragments generated depending on the subsample ratio of the data set. Green line: size of the final vocabulary by filtering the fragments (threshold rare fragments: 0.1 %, threshold common fragments: 10 %). Please note the left y-axis uses a log scale. Red line: percent of overlapping vocabulary fragments used to build the vocabulary. All results represent the mean value of five randomly drawn subsets. The standard deviation is shown in shaded area (see beginning of the red line). 930x617mm (96 x 96 DPI)

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 11. Runtimes building chemical topic models on the ChEMBL 22 dataset. Blue line: runtime fitting process of the model building. Green line: runtime transforming of the data to the model. Please note that the transforming runtime of the 500-topic model is not available due to memory limitations (16GB RAM). 454x296mm (96 x 96 DPI)

ACS Paragon Plus Environment

Page 52 of 54

Page 53 of 54

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 12. 100-topic model of the ChEMBL 22 dataset. Top: Mean topic profile of the ChEMBL 22 molecules. Easily interpretable topics are marked with blue arrows. The bars are colored by the number of fragments important for the topic (the darker blue the more fragments are associated with this topic). Bottom: top eight fragments of four different topics are shown: Cyclic peptides, small proteins, DNA/RNA and steroids. Chemical topic model: 100 topics, Morgan FP fragmentation, filtered matrix (threshold rare fragments: 0.1 %, threshold common fragments: 10 %), seed LDA: 42, sub-sampling size: 10 % of the ChEMBL 22 dataset. 1589x1667mm (96 x 96 DPI)

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1500x746mm (96 x 96 DPI)

ACS Paragon Plus Environment

Page 54 of 54