Public Domain HTS Fingerprints: Design and Evaluation of Compound

Jan 14, 2016 - We use these PubChem HTSFPs as molecular descriptors in hit ... Big Data in Computational Toxicology: Challenges and Opportunities...
0 downloads 0 Views 2MB Size
Article pubs.acs.org/jcim

Public Domain HTS Fingerprints: Design and Evaluation of Compound Bioactivity Profiles from PubChem’s Bioassay Repository Kazi Yasin Helal,†,§,∥ Mateusz Maciejewski,#,∥ Elisabet Gregori-Puigjané,† Meir Glick,‡ and Anne Mai Wassermann*,# †

Novartis Institutes for Biomedical Research Inc., 250 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States Pfizer Inc., 610 Main Street, Cambridge, Massachusetts 02139, United States ‡ Merck Research Laboratories, Boston, Massachusetts 02115, United States #

S Supporting Information *

ABSTRACT: Molecular profiling efforts aim at characterizing the biological actions of small molecules by screening them in hundreds of different biochemical and/or cell-based assays. Together, these assays yield a rich data landscape of target-based and phenotypic effects of the tested compounds. However, submitting an entire compound library to a molecular profiling panel can easily become cost-prohibitive. Here, we make use of historical screening assays to create comprehensive bioactivity profiles for more than 300 000 small molecules. These bioactivity profiles, termed PubChem high-throughput screening f ingerprints (PubChem HTSFPs), report small molecule activities in 243 different PubChem bioassays. Although the assays originate from originally independently pursued drug or probe discovery projects, we demonstrate their value as molecular signatures when used in combination. We use these PubChem HTSFPs as molecular descriptors in hit expansion experiments for 33 different targets and phenotypes, showing that, on average, they lead to 27 times as many hits in a set of 1000 chosen molecules as a random screening subset of the same size (average ROC score: 0.82). Moreover, we demonstrate that PubChem HTSFPs retrieve hits that are structurally diverse and distinct from active compounds retrieved by chemical similarity-based hit expansion methods. PubChem HTSFPs are made freely available for the chemical biology research community.



INTRODUCTION Advances in high-throughput screening technologies have made it possible to characterize entire small molecule libraries in biological profiling experiments.1,2 For compound libraries, gene expression, proteomics, or imaging assays can be run to systematically capture changes in RNA levels, protein products, or cellular morphology following compound treatment.3−7 For each compound, the observed biological perturbations serve as a molecular signature. Under the assumption that compounds with a similar mechanism-of-action generate similar signatures, biological activities from profiling experiments have been used successfully as molecular descriptors for mechanism-of-action elucidation, target identification, or hit expansion.8 However, despite the continuous drop in cost for “omics” experiments, expenses are still prohibitive for a systematic profiling of screening collections containing hundreds of thousands or even millions of compounds. A cost saving alternative to the © XXXX American Chemical Society

multiparametric profiling experiments described above are sotermed HTS f ingerprints (HTSFPs), first introduced by Novartis in 2012.9 About 200 biochemical and cell-based HTS assays that had been run at the company were combined into a biological fingerprint, and small molecules were compared based on their activities across this historical assay panel. No additional screening cost was incurred as only historical data from the corporate databases was used. The ability of these HTSFPs to capture the mechanism-of-action of small molecules was demonstrated in multiple virtual screening and target prediction studies9−11 and HTSFPs are routinely used for hit expansion in early screening stages of pipeline projects. However, Novartis HTSFPs are built based on proprietary data and as such are not accessible to the public. Received: August 10, 2015

A

DOI: 10.1021/acs.jcim.5b00498 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 1. HTSFP generation workflow. For simplicity, the removal of counter-assays and other assays without typical primary activity values is omitted from the workflow.

Herein, we aggregate assays deposited in the PubChem Bioassay repository12 to build a publicly available HTS fingerprint. Other efforts to leverage public assay data to compare small molecules have been published previously. For example, Cheng et al. used NCI60 cancer cell line proliferation assays taken from PubChem’s Bioassay repository to build bioactivity profiles for 4296 small molecules and predict compound−target associations using a nearest neighbor approach.13 Riniker et al. built HTSFPs from 95 PubChem assays deposited by the NCATS Chemical Genomics Center, the Scripps Research Institute Molecular Screening Center, and the Burnham Center for Chemical Genomics that had screened more than 338 000 compounds.14 Whereas the focus of their study was the comparison of different machine-learning algorithms on these HTSFPs and assays were simply chosen by the number of compounds that they had screened, we concentrate on the generation, performance evaluation, and publication of a balanced PubChem HTSFP. While the concept of HTSFPs is not novel, it is important to understand that the performance of this descriptor type is solely dependent on the quality and biological relevance of the assays that are used to build the fingerprint. Hence, it cannot be taken for granted that HTSFPs built from a different assay panel meet the performance reported by Petrone and co-workers at Novartis.9 Also, while all the data for the Novartis HTSFP was generated in-house, i.e., at the same screening facility, we here combine assays from different academic high-throughput screening centers. Care is taken to assemble a bioactivity profile that covers both biochemical and cell-based assays, spans a large variety of target classes, and is mostly complete for the compound set under study (i.e., most compounds have been tested in most assays). This study operates on an unprecedented scale by building HTS fingerprints consisting of 243 bioassays for more than 300 000 compounds, and a stepby-step procedure for fingerprint generation is described in detail. In benchmark trials, we demonstrate that the newly built PubChem HTS fingerprints show superior performance to a state-of-the art chemical structure based hit expansion method across a variety of target classes and yield structurally diverse hits that are beyond the prediction capability of 2D chemical

descriptors, corroborating observations previously reported for HTSFPs using proprietary screening data.9



MATERIALS AND METHODS PubChem HTS Fingerprint Generation. We determined all assays from the PubChem BioAssay database that met the following search criteria: “TotalSidCount from 10,000”, “Chemical”, “Primary Screening”, and “NIH Molecular Libraries Program”. For these 579 assays, we downloaded csv and xml files with assay activity summaries and metadata. We then identified all compounds that had been tested in the 579 assays (we used PubChem compound identifiers (CIDs) to distinguish between molecules) and determined the number of assays in which they had been measured. Only CIDs that had been tested in at least 250 of the 579 assays were kept. Then, we calculated the number of compounds per assay and excluded all assays with less than 288 000 compounds. The remaining assay set was further filtered by discarding most assays that were labeled as counter-screens and assays for which no typical primary activities were reported. For all compound identifiers, SMILES15 strings were downloaded from the PubChem database. Then the cheminformatics software Pipeline Pilot16 (version 8.5) was used to generate molecules from SMILES and standardize their stereo configuration and charges after salt removal. For the standardized molecules, IUPAC InChI Keys17 were generated (one molecule failed this step). In a final step, to have a consistent activity representation across all assays in the fingerprint, we converted primary activity values for all assays into Z-scores. For each assay, we determined the average activity of all tested compounds and the standard deviation of activity values. Z-scores were obtained by subtracting the average activity from a compound’s activity value and dividing the activity difference by the standard deviation. If two or more compound identifiers mapped to the same InChI Key and had different activities, their calculated Zscores were averaged to yield a final Z-score for the fingerprint. The downloaded xml files were parsed to associate assays with Entrez gene identifiers. The bioassay research database (BARD)18 and manual classification were used to determine assay formats and label assays as “biochemical” or “cell-based”. B

DOI: 10.1021/acs.jcim.5b00498 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 2. HTSFP composition. (a) Distribution of the number of assays in which each compound was tested. (b) Distribution of the number of compounds tested in the HTSFP assays.

The fingerprint generation process is schematically depicted in Figure 1. Data Set Reduction. For each compound, the ratio of the number of assays in which it was a hit and the number of assays in which it was tested (i.e., its assay hit rate) was determined. For this purpose, a hit was defined as a compound with an absolute Z-score ≥ 3. All compounds with an assay hit rate > 0.05 were excluded from the hit expansion experiments reported in the main text of this publication. This was done to reduce the influence of potential frequent hitters on the performance of chemical and biological descriptors. It is the aim of this study to determine whether PubChem HTSFPs can detect specific bioactivity patterns that are predictive for the activity of small molecules in a given assay. The identification of promiscuous compounds that have a general higher propensity than other small molecules to be bioactive could act as a confounding factor that we wanted to exclude upfront. For completeness, results that were obtained on the full data set including promiscuous compounds are reported in the Supporting Information. Key findings were the same for both the reduced and the complete data sets. Hit Expansion Experiments. To test the ability of HTSFPs to expand around known bioactive hits, we simulated a typical scenario for a drug discovery project: instead of testing the whole compound deck, a pilot screen is run that assays 20% of the collection. Active and inactive chemical matter from this pilot screen is used to build a model that distinguishes between active and inactive compounds. Those compounds that have the highest likelihood of being active are then picked for the next round of testing. In our benchmark trials, we ran hit expansion calculations for 33 separate assays. For each assay we followed the reported PubChem activity outcome to classify compounds as active or inactive. 20% of the active and 20% of the inactive compounds were then chosen to constitute the pilot screening set. Their HTSFPs or structural 2D fingerprints (extended connectivity fingerprints (ECFP4s),19 as implemented in Pipeline Pilot) were used as descriptors to build naı̈ve Bayes models (component “Learn Good Molecules” in Pipeline Pilot). All 33 assays were part of the HTSFP; however, in each hit expansion experiment the assay under study was not considered as a feature and excluded from the HTSFP before model building. The ECFP4 and HTSFP models were then applied to the test set containing the 80% of the active and inactive compounds not used for model building (i.e., in our simulated scenario those compounds that have not been tested yet but are part of the screening deck and could be screened in

the next round of experimental testing). Receiver operating characteristic (ROC) curves were determined to assess the quality of the HTSFP and ECFP4 models. ROC curves plot the true positive rate against false positive rate at different Bayes score thresholds (see Supplementary Figure S1). The larger the area under the curve (AUC), the better the performance of the classifier. Herein, we use ROC AUC as a performance metric for HTSFP and ECFP4 models and refer to it as ROC score. Furthermore, enrichment factors (EFs) were calculated for selected compound sets as follows: EF =

hit rateselected set hit rate test set

with hit rateselected set =

no. found active molecules no. molecules in selected set

and hit rate test set =

no. all active molecules in test set no. all molecules in test set

For each assay, five trials were carried out, i.e., the tested compound set was randomly divided into five subsets of equal size and each subset was used once as pilot screening set. To assess the chemical diversity of hits, we calculated Bemis and Murcko scaffolds20 for all active compounds. Scaffolds that were not found among the molecules tested in the pilot screen and used for model building were considered novel chemotypes. For selected compound sets, the percentage of novel chemotypes that were discovered was calculated and compared for biological and chemical structure based hit expansion methods.



RESULTS AND DISCUSSION We found 579 PubChem bioassays that met our initial search criteria. Overall, 715 314 different compounds (CIDs) were reported for these assays. We used a simple two-step approach to generate a mostly complete compound-versus-assay matrix. HTSFP Generation and Characterization. First, we determined 329 019 compounds that had been tested in at least 250 assays and were kept for our analysis. Second, we identified 284 assays in which at least 288 000 of these retained compounds had been tested. All other assays that had tested fewer compounds were discarded. We then removed 19 counter-screens and 22 assays that did not report primary, C

DOI: 10.1021/acs.jcim.5b00498 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 3. ROC scores by assay. (a) ROC scores obtained with HTSFPs are displayed as boxplots. Each boxplot depicts one test assay and indicates the distribution of scores over five trials. Assays are arranged in increasing order of their median ROC score and designated by their PubChem AID. (b) ROC scores obtained with ECFP4 are visualized as boxplots. The assays are arranged in the same order as in part a.

single concentration activity values so that, in the end, our final fingerprint assay panel consisted of 243 different assays (see Supplementary Data Set S1 for PubChem assay identifiers (AIDs) and descriptions). Molecule standardization identified redundancies for 244 CIDs, i.e. the InChI Key that they mapped to was also found for another CID. After CID merging, 328 893 unique molecules remained for which PubChem HTSFPs were calculated. Next, Z-scores were calculated for all compounds. Some assays reported a PubChem activity outcome (“active” or “inactive”) for all compounds but provided quantitative activity values that were needed for Zscore conversions for only parts of the compound set. This led to a substantial loss of compounds for one assay (AID 651550 for which only 45 030 compounds remained). Overall, 77 348 145 activities (Z-scores) are reported by the activity profiles. Hence, the compound versus assay matrix is 96.8% complete. Only 1163 (0.35%) of all molecules have been tested in less than 100 assays and 6481 (1.97%) have been tested in less than 200 assays (Figure 2a). Similarly, for the large majority of assays (91.8%), a Z-score was reported for more than 300 000 compounds (Figure 2b). We then further characterized the 243 assays in our panel by determining their assay formats and, if identifiable, the targets that they were measuring with their readouts. Here, 111 assays

were biochemical, and 132 were cell-based. Overall, modulations of 198 different proteins were monitored by the assay readouts (Supplementary Data Set 1). Many of these proteins were enzymes (in particular proteases and kinases) and G protein-coupled receptors (GPCRs). Perhaps unsurprisingly, most assays for GPCRs, which are membrane receptors, were cell-based whereas most enzyme assays were biochemical. It should be noted that, in some instances, more than one assay was available for the same target. These assays were mostly searching for different active agents, e.g. antagonists, agonists or positive allosteric modulators at the same receptors. Exclusion of Promiscuous Compounds. The large majority of the compounds showed strong biological signals in only few assays. For example, 71.5% of the compounds in the data set hit in two or less assays. Assay hit rates for the whole data set are shown in Supplementary Figure S2. 71.2% of all compounds hit in maximally 1% of the assays in which they had been tested. Only 14 488 compounds (4.4%) were active in more than 5% of the assays in which they were tested. These compounds were considered frequent hitters and excluded from the data set although it should be noted that this is a very strict definition of a frequent hitter, which is here applied for the purpose of creating a high-quality benchmark set. It is not our intention to D

DOI: 10.1021/acs.jcim.5b00498 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling flag these compounds as generally undesirable for other applications. Hit Expansion. We assessed the utility of HTSFPs for hit expansion, a task routinely performed in drug discovery. For the tests, we selected 33 assays that covered a wide variety of target classes; 16 of these assays were biochemical, and 17 were cell-based (Supplementary Table S1). For each assay, five different hit expansion trials using either ECFP4 features or HTSFPs as descriptors for model building were carried out. For each trial, a ROC score was calculated for the ranked test set; the ROC scores of the five trials were averaged to yield a final performance metric for each test assay. Average ROC scores for the 33 assays varied between 0.64 and 0.95 for HTSFPs and between 0.52 and 0.87 for ECFP4. Performance differences were very clear with HTSFPs outperforming ECFP4 descriptors for 29 of the 33 assays (Figure 3). HTSFPs yielded an average ROC score of over 0.8 for 23 and of over 0.9 for nine assays. As discussed earlier, all compounds that hit in more than 5% of the assays in which they had been tested were excluded from the data set. Nevertheless, to make sure that the excellent performance of HTSFPs was not due to the enrichment of nonspecific bioactive compounds toward the top of the ranking, we set up a control experiment suggested by Riniker et al.14 For each of the 33 scenarios, we ranked all tested compounds in descending order of their median absolute Z-score in all HTSFP assays (excluding the respective test assay). For the resulting ranked compound list, ROC scores were calculated. This control experiment indicates whether the ability of HTSFP to enrich active compounds at the top of the ranking results from the detection of a specific bioactivity pattern for the hits or whether the model might simply favor strongly bioactive compounds with high Z-scores across the assay panel. In the latter case, a much more simple method that brings frequently bioactive compounds toward the top of the ranking could be expected to achieve similar performance. However, our control experiments yielded lower ROC scores for the ranking by median Z-score for all 33 assays (Supplementary Figure S3), hence demonstrating that our HTSFP Bayes models account for more than just the general propensity of a compound to be bioactive. Still, it should be noted that, also for this simple ranking method, ROC scores were clearly distinct from 0.5 and hence much better than random for most assays. This demonstrates that compounds that have shown activity across the assay panel in the past are more likely to show activity in a future assay.21 We then analyzed whether the performance of HTSFP Bayes classifiers depended on the assay format (Figure 4). We noted that better ROC scores were obtained for cell-based assays (36 trials with a ROC score ≥0.9, whereas this score is only achieved for 8 trials using biochemical test assays), but differences in average ROCs were rather small (0.80 for biochemical assays and 0.84 for cell-based assays). This implies that HTSFPs can be used successfully for expansion around hits from both assay formats. ROC scores are one of the most frequently used and widely accepted measures to assess model performance. But while the whole ranking of the test set is considered for the calculation of ROC scores, the number of hits in the most highly ranked molecules, i.e. early enrichment of hits among the subset of compounds that is going to be tested experimentally in the next round, is much more relevant for a practical hit expansion application. Therefore, we calculated enrichment factors for the top 1000 molecules. Using HTSFPs for Bayes model building

Figure 4. ROC scores by assay format. Boxplots showing the distributions of ROC scores obtained for biochemical or cell-based test assays.

and hit expansion, we found 27.2 times as many hits as expected in a random selection of 1000 molecules (median enrichment factor across all 33 test assays; Table 1). By comparison, ECFP4 descriptors led to a 29.6-fold enrichment (Table 1). Interestingly, when considering enrichment factors and not ROC scores as performance metric, HTSFPs outperformed ECFP4 only in 16 of the 33 assays. This is not unexpected, considering that structural fingerprints have their strength in finding active molecules in the close chemical neighborhood of known active compounds but encounter more and more difficulties to predict activity the further the test molecules are away in chemical space. To investigate the structural diversity of the found hits, we converted all active molecules to their Bemis and Murcko scaffolds. We then determined the number of unique novel scaffolds among the retrieved active molecules in the top 1000. To be counted as a novel chemotype, a scaffold was not allowed to be represented by any active molecule in the training set. We divided this number by the total number of novel scaffolds among all test molecules to determine scaffold recall. For HTSFP, an average scaffold recall of 12.2% was observed, whereas ECFP4 descriptors found on average only 7.9% of the novel active scaffolds among the top 1000 molecules (Figure 5). In terms of scaffold recall, HTSFPs outperformed the structural fingerprints in 24 of the 33 assays. This is in line with previous studies that reported greater structural diversity for hits found with Novartis HTSFPs when compared to chemical descriptors.9,10 As HTSFPs are by design agnostic to chemical structure, they have a high ability to hop in chemical space and recognize active molecules in regions far away from the areas populated by the training set. Next we analyzed the complementarity of biological and chemical descriptors. For each hit expansion trial, we determined what percentage of the hits retrieved by one method was also discovered by the other. On average, 18.9% of the hits retrieved by HTSFPs and 19.6% of the hits found by ECFP4 were also discovered by the other model (Supplementary Figure S4). Turning this finding on its head, this means that, on average, more than 80% of hits are unique for both descriptor types. This highlights the fact that chemical and biological descriptors are highly complementary. Therefore, their joint use for hit expansion or other drug discovery applications has been recommended by multiple studies in the past.9,11 E

DOI: 10.1021/acs.jcim.5b00498 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling Table 1. Enrichment Factorsa HTSFP

ECFP4

PubChem AID

average hit number

standard deviation

enrichment factor

average hit number

standard deviation

enrichment factor

2521 2557 2650 435003 435005 435022 435030 485275 485346 488862 488965 493012 493091 493131 504406 504462 504621 504690 504720 504803 588354 588358 588405 588499 588852 602123 602261 602281 602438 624040 624126 624304 624352

89.40 8.60 28.60 205.20 123.60 70.00 50.80 31.60 478.20 70.40 3.00 69.20 255.40 100.20 3.00 45.40 29.20 58.80 45.00 131.60 20.20 36.00 352.60 57.60 283.80 107.80 64.20 110.80 66.00 134.80 47.00 37.40 148.20

10.64 3.21 3.71 19.47 7.37 11.22 5.07 10.01 12.19 8.02 1.00 17.84 22.55 8.61 2.24 4.34 5.89 12.50 7.55 12.30 1.92 7.52 10.41 8.02 26.47 16.53 7.98 7.82 12.90 12.03 5.83 2.41 14.34

43.44 4.83 33.55 66.83 37.29 27.77 27.26 8.75 21.70 40.02 5.54 28.25 39.21 11.62 13.76 10.79 24.43 12.09 20.27 62.29 118.98 8.48 40.66 54.52 27.29 36.86 11.24 25.23 8.32 34.02 11.60 10.52 44.10

73.00 3.80 65.80 90.80 95.00 79.80 145.40 13.80 275.20 16.00 21.80 177.20 144.60 191.20 4.80 129.80 42.60 127.20 43.20 99.40 5.20 20.40 183.60 97.40 352.40 26.00 243.40 187.40 71.80 254.20 19.00 38.80 121.40

11.66 1.30 3.35 5.45 12.35 9.47 4.04 5.22 12.05 3.00 6.02 11.52 17.57 17.92 1.92 17.73 5.37 13.75 4.76 11.84 2.95 4.04 17.70 6.95 32.73 6.00 8.29 7.13 4.60 14.72 4.74 6.83 12.64

35.47 2.13 77.18 29.57 28.66 31.66 78.03 3.82 12.49 9.10 40.26 72.33 22.20 22.17 22.02 30.86 35.65 26.16 19.46 47.05 30.63 4.81 21.17 92.19 33.89 8.89 42.62 42.67 9.06 64.15 4.69 10.91 36.13

a

For the 33 test assays subjected to HTSFP- and ECFP4-based hit expansion, the average number of hits in the top 1000 molecules and the standard deviation across the five individual trials are reported. The average number of hits is compared to the number of hits expected in a randomly chosen set of 1000 molecules (assuming the background hit rate in the screening collection) to calculate enrichment factors for all trials.

Lastly, we tried to find out when HTSFP models perform especially well. We hypothesized that it should be easier to learn a model if compound activities in the assay under study are correlated to activities in other assays in the data set. Therefore, we calculated pairwise Pearson correlation coefficients between the assays in our HTSFP. For each assay pair, Z-scores of the compounds that had been tested in both assays were used for correlation calculations. A heatmap showing correlations between all assays in the HTSFP is shown in Supplementary Figure S5. It is apparent that most assays are uncorrelated to each other. However, some clusters of interrelated assays emerge. The biggest cluster comprised 16 assays, of which 6 assays had been in our test set (AIDs: 435003, 485346, 488862, 588354, 588405, 624352). For five of the six assays, high average ROC scores >0.9 were obtained (see Supplementary Table S2 for the relationship between assay correlation and ROC scores for the 33 test assays). However, when looking at these six assays in more detail, we noticed that they were directed against diverse targets but all used a luminescence-based readout. Hence, for these assays, it is not unlikely that a fraction of the hits are false positives that interfere with the assay technology, some of which might be

discovered as hits by our HTSFPs. Compounds that interfere with a specific assay technology can lead to a defined activity pattern in the HTSFPs. As a result, the HTSFP Bayes model cannot distinguish between true bioactivity patterns or assay technology patterns. Therefore, to optimize the performance of HTSFPs and avoid the detection of false positives an assay technology specific frequent hitter model could be applied and used to exclude those compounds from the training set. However, it is important to understand that this does not invalidate the application of HTSFPs in general. Strikingly, even if no high correlation of a test assay to any other assay in the HTSFP can be found, high ROC scores can be obtained (Supplementary Table S2). This demonstrates one of the chemogenomics principles: bioactivity against a given target can be expressed as a function of bioactivities on nonrelated targets. Next, we analyzed whether there were differences in performance among target classes. We used Entrez gene identifier from the PubChem bioassay metadata and a Novartis target classification scheme to link three test assays to proteases, four test assays to GPCRs, and seven test assays to enzymes in general (excluding proteases). While high ROC scores were obtained for test assays from all target classes (Supplementary F

DOI: 10.1021/acs.jcim.5b00498 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 5. Scaffold recall. (a) Displayed is the recall of novel active scaffolds among the top 1000 molecules using HTSFPs. Each boxplot reports results for one test assay. Assays are arranged in increasing order of their median scaffold recall and designated by their PubChem AID. (b) Scaffold recall achieved with ECFP4 descriptors. The assays are arranged in the same order as in part a.

proteases but no further assay for E. coli, let alone an E. coli protease.

Figure S6), GPCRs obtained overall highest ROC scores. Average ROC scores for GPCRs, enzymes, and proteases were 0.87, 0.82, and 0.82, respectively. Three of the four GPCR test assays were directed against muscarinic acetylcholine receptors (CHRM). For these three assays (AIDs: 588852, 624040, 624126), ROC scores were higher than for the fourth GPCR assay (AID: 2521) which was used to screen for antagonists of the apelin receptor (APJ). The CHRM target family had a strong presence in the HTSFP (nine assays). Perhaps not surprisingly, for the two test assays 588852 and 624040 that aimed at the identification of antagonists for CHRM1 and CHRM5, the most similar assay that we identified based on ligand similarity (Supplementary Table S2) was assay 624125, which screened for CHRM4 antagonists. Similarly, test assay 624126, a screen for positive allosteric modulators (PAMs) of CHRM4, was most similar to an assay for PAMs of CHRM1. This indicates that the presence of assays for closely related targets in the HTSFP is able to boost prediction performance. For proteases, ROC scores are widespread across different trials ranging from 0.69 to 0.92. In this case, good ROC scores are obtained for two assays that aim at the identification of inhibitors of the human cysteine protease ATG4B (AID: 504462) and the human serine protease HTRA1 (AID: 504803). By contrast, a rather weak performance is observed for the E. coli DNA-binding ATP-dependent protease La (AID: 602123). This is intuitive given that the HTSFP contains a multitude of assays for other human cysteine and serine



CONCLUSIONS In this study we used publicly available HTS data from PubChem’s bioassay repository to build bioactivity profiles for more than 300 000 compounds. We describe in detail all measures that were taken to select assays and compounds and standardize activities and small molecules to yield a rather complete compound-versus-assay Z-score matrix. A representation of bioactivities as Z-scores was preferred over a binary representation (active, inactive) to better describe different levels of bioactivity, i.e., Z-scores enable the characterization of compounds as strong or weak hits and the encoding of different activity directionalities (agonism vs antagonism) for the same assay. This decision is in accordance with the results from Riniker et al.14 who compared binary and continuous bioactivity profiles for a smaller set of PubChem assays and found an improved performance when Z-scores were used as activity representation. In hit expansion experiments on 33 assays we demonstrate that PubChem HTSFPs are able to retrieve structurally diverse molecules with desired bioactivities. PubChem HTSFPs enrich hit expansion sets with active molecules across a broad range of target classes and for different assay formats (cell-based and biochemical). In control experiments, we showed that PubChem HTSFPs capture assay-specific activity patterns and G

DOI: 10.1021/acs.jcim.5b00498 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling Notes

are superior to screening libraries consisting of small molecules with a high propensity for general (promiscuous) bioactivity. As reported for Novartis HTSFPs in previous studies, hits found by PubChem HTSFPs are often complementary to hits found by chemical structure-based methods but display greater chemotype variety, thereby yielding more potential starting points for hit follow-up and optimization. The compound-versus-assay Z-score matrix is made publicly available and can be obtained from the authors through a file sharing system. The Z-scores were calculated based on singlepoint measurements. For future studies, data sets reporting dose−response measurements will provide an interesting alternative for HTSFP generation. The usage of IC50 values could reduce some of the noise that is inherent to singleconcentration experiments. Advances in assay miniaturization have made it possible to run titration experiments on a large scale.22 These assays are labeled as qHTS in the PubChem bioassay database and constitute a prime resource for future experiments with dose−response HTSFPs. By and large, we hope that PubChem bioactivity profiles will be helpful for other researchers that cannot make substantial investments in small molecule profiling experiments but would like to go beyond chemical structure based methods for hit expansion. An important limitation of all experimentally derived descriptors is that they can only be determined for those compounds that have been assayed, i.e., pairwise similarity calculations can only be carried out for the 300 000 compounds that are annotated with a PubChem HTSFP. To expand around compounds that are not part of the HTSFP matrix, an approach termed Bioturbo Similarity Searching has been introduced in the past (and was evaluated on Novartis HTSFPs).23 This method combines chemical and biological similarity: for a bioactive compound, its structural neighbors in the HTSFP matrix are identified and their HTSFPs are then used as surrogate profiles in hit expansion, thereby enabling scaffold hopping for the compound without HTSFP annotation. With this approach, PubChem HTSFPs could become applicable for a variety of screening campaigns and constitute an integral part of a chemoinformatician’s toolbox.



The authors declare no competing financial interest.



ACKNOWLEDGMENTS K.Y.H. was supported by the Education Office of the Novartis Institutes for Biomedical Research. The authors would like to thank Eugen Lounkine, Yuan Wang, and Iain Wallace for helpful discussions, Robert Stanton for thoughtful edits to the manuscript, and the reviewers for their constructive criticism.



(1) Feng, Y.; Mitchison, T. J.; Bender, A.; Young, D. W.; Tallarico, J. A. Multi-Parameter Phenotypic Profiling: Using Cellular Effects to Characterize Small-Molecule Compounds. Nat. Rev. Drug Discovery 2009, 8, 567−578. (2) Johannessen, C. M.; Clemons, P. A.; Wagner, B. K. Integrating Phenotypic Small-Molecule Profiling and Human Genetics: The next Phase in Drug Discovery. Trends Genet. 2015, 31, 16−23. (3) Lamb, J.; Crawford, E. D.; Peck, D.; Modell, J. W.; Blat, I. C.; Wrobel, M. J.; Lerner, J.; Brunet, J.-P.; Subramanian, A.; Ross, K. N.; Reich, M.; Hieronymus, H.; Wei, G.; Armstrong, S. A.; Haggarty, S. J.; Clemons, P. A.; Wei, R.; Carr, S. A.; Lander, E. S.; Golub, T. R. The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease. Science 2006, 313, 1929−1935. (4) Wawer, M. J.; Li, K.; Gustafsdottir, S. M.; Ljosa, V.; Bodycombe, N. E.; Marton, M. A.; Sokolnicki, K. L.; Bray, M.-A.; Kemp, M. M.; Winchester, E.; Taylor, B.; Grant, G. B.; Hon, C. S.-Y.; Duvall, J. R.; Wilson, J. A.; Bittker, J. A.; Dančík, V.; Narayan, R.; Subramanian, A.; Winckler, W.; Golub, T. R.; Carpenter, A. E.; Shamji, A. F.; Schreiber, S. L.; Clemons, P. A. Toward Performance-Diverse Small-Molecule Libraries for Cell-Based Phenotypic Screening Using Multiplexed High-Dimensional Profiling. Proc. Natl. Acad. Sci. U. S. A. 2014, 111, 10911−10916. (5) Molnár, E.; Hackler, L.; Jankovics, T.; Ü rge, L.; Darvas, F.; Fehér, L. Z.; Lő rincz, Z.; Dormán, G.; Puskás, L. G. Application of Small Molecule Microarrays in Comparative Chemical Proteomics. QSAR Comb. Sci. 2006, 25, 1020−1026. (6) Perlman, Z. E.; Slack, M. D.; Feng, Y.; Mitchison, T. J.; Wu, L. F.; Altschuler, S. J. Multidimensional Drug Profiling by Automated Microscopy. Science 2004, 306, 1194−1198. (7) Young, D. W.; Bender, A.; Hoyt, J.; McWhinnie, E.; Chirn, G.W.; Tao, C. Y.; Tallarico, J. A.; Labow, M.; Jenkins, J. L.; Mitchison, T. J.; Feng, Y. Integrating High-Content Screening and Ligand-Target Prediction to Identify Mechanism of Action. Nat. Chem. Biol. 2008, 4, 59−68. (8) Wassermann, A. M.; Lounkine, E.; Davies, J. W.; Glick, M.; Camargo, L. M. The Opportunities of Mining Historical and Collective Data in Drug Discovery. Drug Discovery Today 2015, 20, 422−434. (9) Petrone, P. M.; Simms, B.; Nigsch, F.; Lounkine, E.; Kutchukian, P.; Cornett, A.; Deng, Z.; Davies, J. W.; Jenkins, J. L.; Glick, M. Rethinking Molecular Similarity: Comparing Compounds on the Basis of Biological Activity. ACS Chem. Biol. 2012, 7, 1399−1409. (10) Wassermann, A. M.; Kutchukian, P. S.; Lounkine, E.; Luethi, T.; Hamon, J.; Bocker, M. T.; Malik, H. A.; Cowan-Jacob, S. W.; Glick, M. Efficient Search of Chemical Space: Navigating from Fragments to Structurally Diverse Chemotypes. J. Med. Chem. 2013, 56, 8879−8891. (11) Wassermann, A. M.; Lounkine, E.; Urban, L.; Whitebread, S.; Chen, S.; Hughes, K.; Guo, H.; Kutlina, E.; Fekete, A.; Klumpp, M.; Glick, M. A Screening Pattern Recognition Method Finds New and Divergent Targets for Drugs and Natural Products. ACS Chem. Biol. 2014, 9, 1622−1631. (12) Wang, Y.; Suzek, T.; Zhang, J.; Wang, J.; He, S.; Cheng, T.; Shoemaker, B. A.; Gindulyte, A.; Bryant, S. H. PubChem BioAssay: 2014 Update. Nucleic Acids Res. 2014, 42, D1075−D1082. (13) Cheng, T.; Li, Q.; Wang, Y.; Bryant, S. H. Identifying Compound-Target Associations by Combining Bioactivity Profile

ASSOCIATED CONTENT

* Supporting Information S

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.5b00498. Results for the full data set without exclusion of frequent hitters, a description of the 33 test assays, and graphics illustrating results discussed in the main text (ROC score comparisons, hits found by both ECFP4 and HTSFP descriptors, and assay similarities) (PDF) Supplementary data set 1 (XLS)



REFERENCES

AUTHOR INFORMATION

Corresponding Author

*E-mail: anne.wassermann@pfizer.com. Present Address §

K.Y.H.: Northwestern University, Evanston, Illinois 60208, USA. Author Contributions

∥ The contributions of K.Y.H. and M.M. should be considered equal.

H

DOI: 10.1021/acs.jcim.5b00498 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling Similarity Search and Public Databases Mining. J. Chem. Inf. Model. 2011, 51, 2440−2448. (14) Riniker, S.; Wang, Y.; Jenkins, J. L.; Landrum, G. A. Using Information from Historical High-Throughput Screens to Predict Active Compounds. J. Chem. Inf. Model. 2014, 54, 1880−1891. (15) Weininger, D. SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules. J. Chem. Inf. Model. 1988, 28, 31−36. (16) Warr, W. A. Scientific Workflow Systems: Pipeline Pilot and KNIME. J. Comput.-Aided Mol. Des. 2012, 26, 801−804. (17) Pletnev, I.; Erin, A.; McNaught, A.; Blinov, K.; Tchekhovskoi, D.; Heller, S. InChIKey Collision Resistance: An Experimental Testing. J. Cheminf. 2012, 4, 39. (18) de Souza, A.; Bittker, J. A.; Lahr, D. L.; Brudz, S.; Chatwin, S.; Oprea, T. I.; Waller, A.; Yang, J. J.; Southall, N.; Guha, R.; Schürer, S. C.; Vempati, U. D.; Southern, M. R.; Dawson, E. S.; Clemons, P. A.; Chung, T. D. Y. An Overview of the Challenges in Designing, Integrating, and Delivering BARD: A Public Chemical-Biology Resource and Query Portal for Multiple Organizations, Locations, and Disciplines. J. Biomol. Screening 2014, 19, 614−627. (19) Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742−754. (20) Bemis, G. W.; Murcko, M. A. The Properties of Known Drugs. 1. Molecular Frameworks. J. Med. Chem. 1996, 39, 2887−2893. (21) Wassermann, A. M.; Lounkine, E.; Hoepfner, D.; Le Goff, G.; King, F. J.; Studer, C.; Peltier, J. M.; Grippo, M. L.; Prindle, V.; Tao, J.; Schuffenhauer, A.; Wallace, I. M.; Chen, S.; Krastel, P.; Cobos-Correa, A.; Parker, C. N.; Davies, J. W.; Glick, M. Dark Chemical Matter as a Promising Starting Point for Drug Lead Discovery. Nat. Chem. Biol. 2015, 11, 958. (22) Inglese, J.; Auld, D. S.; Jadhav, A.; Johnson, R. L.; Simeonov, A.; Yasgar, A.; Zheng, W.; Austin, C. P. Quantitative High-Throughput Screening: A Titration-Based Approach That Efficiently Identifies Biological Activities in Large Chemical Libraries. Proc. Natl. Acad. Sci. U. S. A. 2006, 103, 11473−11478. (23) Wassermann, A. M.; Lounkine, E.; Glick, M. Bioturbo Similarity Searching: Combining Chemical and Biological Similarity to Discover Structurally Diverse Bioactive Molecules. J. Chem. Inf. Model. 2013, 53, 692−703.

I

DOI: 10.1021/acs.jcim.5b00498 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX